Interesting to see this article again since it was the trigger of starting a homelab for me. After realizing cloud services are putting a major dent in my pocket to get a lousy startup idea off the ground, I started to wonder if there's "The Cheapest" way? (I'm not cheap but I'm very frugal)
Nowadays internet speed is great to do self hosting. I have a business line internet at home with ~1gb up&down! Bought couple of 6-7 year old enterprise Dell servers (2x12core xeon, 128gb ram each) and no longer pay any cloud provider ... i'm also hosting 2 backend solutions for mobile apps with decent traffic for friends' startups!
The learning experience has been tremendous! It has actually gotten a lot better and easier with new solutions coming out for homelabs. Get started with Proxmox clusters and go from there...
After this talk [0] I had several most interesting conversations with
media folks about the real cost and advantages of "cloud".
One thing that came up is development. Modern devops culture is quite
a good thing, and what's lovely about "cloud" - as in the ability to
quickly buy compute and storage capability - is that ideas you would
have tinkered with in on-prem labs (or across private sites) for
months can be imagined and prototyped in hours.
I'm a big advocate of rapid prototyping as a _huge_ business lever,
because the ability to try out ideas quickly, to easily reconfigure
things, is the key for time to market. You can quickly see if
something is going to fly or not.
And that's where the advantage ends.
After that, it's all downhill. Asymmetry. Lock-in and portability.
Trust and privacy issues. Security perimeters. Unpredictable costs....
So the way forward is to render unto Caesar only the things that are
Caesar's.... in other words, take the advantages of "cloud" when it
suits you, and then get the hell out of Dodge.
What is ongoing from that conversation is media companies being
interested in strategic planning to build, and even share, their own
distributed computing resources to pull back to once a technology is
off the ground.
Someone even mentioned that it's time for a European Cloud initiative,
Yeah it's interesting devops is a lock-in on the cloud when (if you squint, a lot) it should be the opposite: there are devops tools that .... almost ... should make you more independent.
IMO it should be a sneaky powerful declaration by major corps that your app should be built to be deployed nearly at will on at least two clouds. I mean terraform is so tantalizingly close to it, until it isn't. This is like Bezos sending out the "thou shalt service everything".
AWS knows this and they are all about lock-in. They want you on the more complicated products, because those are really hard to move off of. Oh yes, don't use cassandra, use dynamo. Man you'll never move off that.
So if you let the devs have "you can develop on AWS" but then they have to deploy on Hetzner ... that will force the devs to be far more cloud-independent. I guess if I was a CIO (never let me become one) I'd try to institute that.
> I'm a big advocate of rapid prototyping as a _huge_ business lever, because the ability to try out ideas quickly, to easily reconfigure things, is the key for time to market. You can quickly see if something is going to fly or not.
And that's where the advantage ends.
Too many businesses aren’t even properly utilising that key advantage. They’re moving servers to the cloud but still using their outdated development and deployment processes, and things move just as slowly in the cloud as they used to on prem. They know what Infrastructure as Code means, but only as separate words.
They’re moving servers to the cloud but still using their outdated development and deployment processes, and things move just as slowly in the cloud as they used to on prem.
For many non-tech corps, the purpose of moving to cloud is to downsize IT admin staff. It works well.
your sysadmins are now Cloud Admins and can get an extra 50k in the market with a GCP or AWS certs. you're going to bump up their salary, right?
the useless offshore team is now a Cloud useless offshore team, and also wants their 20% bump. And bet your ass that Tata or Cognizant will get blood from a stone to make it happen, cuz as useless as they may seem you still need them.
change control meetings haven't gone anywhere, and if anything they're more important since now your entire infrastructure is a long one-liner away from being borked; cloud is an API, basically. just because you're not racking and stacking doesn't mean the demand, architecture & design, review boards, implementations, and due diligence steps go faster.
now we need an entirely new strategy to handle costs, since our architects and procurement can't track day to day cost changes easily, so when SuperDev decides he's going to #yolo 6 VMs and a few dozen containers into existence to test a few things we now have launch a technical and financial investigation into 1) how that happened, and 2) how much it cost.
still gotta use fortigates or palo altos, and internal networking hasn't changed too much overall; lean teams to begin with.
so in exchange for shoveling huge quantities of OpEx to companies that don't deserve more money, we don't really cut labor, and lose control of practically every other facet of our infra. Hope that Azure AD doesn't fail again, cuz the dashboard says 100% green but nothing is working and the execs are concerned.
Maybe the most demonstrable is egress/ingress bandwidth. But since
there's a power asymmetry when dealing with mega-corporations I had
other asymmetries in mind too.
The cloud providers are so large that they don’t really need you. It’s all about churn management at the macro level. With hardware and software, there’s always an end of quarter leverage point.
I spent most of my career in large enterprises. The leverage you have against AWS or Microsoft is 0 compared to the old days. They are probably landing more infrastructure every month than my global company had in datacenters 15 years ago.
You can just have on-premise k8s and keep most of the velocity gained from developers being able to "just run stuff" instead of anything having to go thru sysadmins.
You can just rent few servers off OVH to start and not have to worry about actual hardware, while still being few times cheaper than cloud.
Yeah you won't have access to the slew of cloud services and will have to deploy your own database but with amount of readily available code and software to do it it doesn't really slow down experimenting all that much
Yeah the delta in cost/CPU/memory/storage between self hosting on second hand stuff and paying cloud services is insane. It's a no brainer if your use case needs beefier hardware than the typical $5/month VPS host and you don't have enterprise level liability/uptime expectations.
With stuff like proxmox you get a pretty similar level of ease of use to managed VM services too.
I generally find these discussions unproductive on HN because of how binary they become, but definitely once you go off the beaten path it seems a lot cheaper (sans power though) to run your own infra. I've been thinking of doing some data ETL for something that generates 1 GB / day and has retention for 30 days and it's much cheaper to host a Postgres myself at that scale than run it in the cloud. I could store in something like S3, but then I need to deal with querying S3. I'd like to combine some of this with cloud infra but I suspect cloud bandwidth costs would kill me. Colo-ing 2-3 servers is probably the best bet here.
I was an ops engineer at Fastmail. We ran our own hardware. A mix of Debian on one stack and SmartOS (Illumos from Joyent) and there were plenty of physical problems and costs. Now putting petabytes of data with replication and syncing and all that would have been a cluster F on the cloud, we missed out on a lot of the awesome newer deployment other tooling because we had written our own. Before I left we swapped out a good chunk of these snowflake software but it was impressive actually how well it worked. Good multi data center multi master mysql support in 2008? Killer feature. Maintaining that in 2020? Horrible.
Also there were plenty of upstream routing issues where solving that became a headache. The #1 thing we wanted was uptime and the #1 outage was our upstream providers having trouble routing to other upstream providers.
The number one reason and tradeoff for cloud is uptime and availability and the cost of not having it
> The number one reason and tradeoff for cloud is uptime and availability and the cost of not having it
100%. I ran API networking teams at a Big Tech and I know the difference. My workload here is at a hobbyist+ grade, no controls, no compliance, best effort SLA. I don't want to ignore the reality that to get enterprise grade uptime and availability, it's really hard to do this on prem.
If you use it correctly, with multiple availability zones and even multi regions you can reach very high reliability. They surely don't offer five 9s for a single zone.
I am not aware of many multi regions outages.
And they are also getting better over time, spending a lot more engineering hours on reliability that most companies.
And if they fail to deliver on SLA they might give you money (depending on your contract I guess).
Do you really think on-prem has more uptime than commercial cloud? For non-tech companies, no even close. Commercial cloud is adding nines to uptime and saving money by removing in-house IT admin staff.
the biggest miscalculation is really on expected uptime. People say they need 5 nines, yet take their car in for service twice a year for a tire rotation and oil change. How much uptime is _really_ required?
I'm always curious about the power cost here. Everyone I've talked to who runs servers at home either lives in a low COL area where electricity is cheap (and electricity is dirty, but I don't really want to get into that) or just pays the power as a sort of "cost of playing around" which is completely fair (it's not like I need to use my table saw at home.) Most folks I know who run servers use a lot more power than I do at home. Of course I don't know how much of that is just that the folks that build expensive homelabs being attracted to beefy, power hog servers as opposed to an actual incentive to cut power costs.
My rack draws around 750W at mid-idle (not totally idle but not doing anything big either). I have about $0.14/kWh delivered electricity cost in southwest Ohio. This means the rack costs me approximately $77 per month or approximately $920 per year. That's not a particular small cost, but it is far less than I would pay to host the same things at a professional data center, or shivers the cloud.
If I really wanted to optimize for power efficiency, I could do much better. I've seen decent homelab setups (with NAS, router, switch, and some slow compute nodes) that run under 100W, which would cost me only $10 per month in power, and would be far more powerful than a small DigitalOcean droplet.
> I've seen decent homelab setups (with NAS, router, switch, and some slow compute nodes) that run under 100W, which would cost me only $10 per month in power, and would be far more powerful than a small DigitalOcean droplet.
I do some homelabbing at home and i do work for some "big tech". The difference, essentially, is reliability and high availability.
Most homelab posts i see are one decent (not even large) disaster away from losing everything.
I do something in that space at home, mostly around data backup and replication, but i am well aware that in case of decent disaster I'd probably be at least a couple of days offline (potentially up to one or two weeks).
Most people underestimate facet of the discussion.
Agreed. Most people seem to take a very cavalier approach to backups, for example, or using risky Ceph/ZFS setups without understanding the consequences.
In the absolute worst-case scenario for me, barring a lightning strike that fried my UPS and entire rack, I would lose a day’s worth of changes, as my backup node kicks on daily to ingest snapshots. Downtime would probably be 15 minutes or so - boot up backup, change target IP address on other nodes to access it.
I’m only running RAIDZ1, so I’d have to lose two disks in a VDEV for this to occur. I understand and accept the risks, but were I hosting anything of import, I’d probably accept the additional power draw of keeping the backup server on 24/7 and stream snapshots to it continuously.
Also, of course, I’d be streaming those snapshots off-site. Currently I do so for things like photos and documents.
If I lost 2/3 of my compute nodes, I’d be down for a bit longer, as I’d have to shift workloads to the backup server (which is a dual socket with enough RAM to handle it), and currently it doesn’t run K8s. I can shift things to Docker Compose easily enough, or I suppose I could register it as a worker node that’s just tainted most of the time.
I'm posting this just to compare power rates not to win an internet fight. My power bill is tiered, and at the lowest tier I'm paying double your rates ($0.28 / kWh.) Once I add a rack like that on I'd probably be bumping up to the next tier which will increase my power to something like 2.5-3x that rate. Because of the tiered system even my non-server load will be metered at a higher cost. It's really not worth it here to run servers at home if I'm not being highly power conscious.
Meanwhile in Germany, the tiers are the other way around: base fees and meter costs increase effective rate at the low-consumption end, while you end up with only (even discounted, somewhat, usually!) per-kWh rates dominating at high consumption.
Industrial loads also get assessed for peak power sustained for like a few minutes at least once a year, to reflect capex/depreciation of sufficiently-overprovisioned distribution transformers and other related last-mile power-handling capacity.
This is relatively negligible if you average over 10% of your peak draw, though. And even beyond, recent energy prices Matt have shifted the balance spot to even more-peaky consumption.
I looked into cloud, but it's just not really feasible. It really is possible to do a lot with a little at home.
I'm using about 60-100W for my home-prod, and a lot of it is "older". I'm running about 15 small VMs at any given time these days, and probably 20 containers.
I think my biggest single draw is the Mikrotik CCR1036 in the garage, but it saved me from buying new gear. Sure there's a break even point with hydro, but that's years in the future when the device is free. It's also pretty fun to watch VPN connections testing at 700Mbps from home.
I don't really care about uptime, and I've got gigabit fibre to the house, so bandwidth isn't a huge problem. It worked fine on 300/60Mb cable too.
Ryzen 3 2200G, 32GB RAM, 1T NVMe, 10TB HDD. This one runs services.
Orange Pi 5 16GB with another 500GB NVMe. This runs redundant services and monitoring.
My rack lives in an expensive energy market, and pulls just under 500W. Almost 40% of that is a Dell R430, which by itself costs about $50/mo to run.
Next time I have the energy (hah) for this flavor of home maintenance, the idea is to split the work it does off to a few fanless systems, I'm pretty sure I can knock at least 100W off that. Main challenge there is storage - I have a SAS shelf and need a low-wattage machine that speaks SF-8088.
Experimented some with a couple Raspberry Pis for some things, but they just don't seem built to run 24/7. One lasted about 4 months, the other died at about 12. (They PXEbooted, no local storage, it wasn't that.)
I have a Plex server with 6 spinning disks and a GPU which costs me around $20/month. The most expensive part is when the GPU is running (based on my Kill-A-Watt.) I host around 15 Docker services on it.
I think even with the cost of electricity, you can easily beat cloud hosting on a per-month basis. But, factoring in the initial cost of hardware and electricity, it's probably a wash.
But then, if you're running a hypervisor, and would otherwise have a LOT of VMs in the cloud, maybe it swings back the other way?
My quarter rack Home lab in my basement is pulling 133 Watts right now with a low power 1U Supermicro SOC, an older Supermicro mini tower used as a NAS, and my networking gear and Internet modems. I don't have any intense workloads running at home. I was really power conscious when buying my two home servers as I know it is very easy to buy a beefy server off of Ebay that sounds like a jet engine 24/7 and pay out the nose on my home power bill.
My rack is about 500W, up to about 700W when the backup server kicks on (lots of spinning disks).
That’s a UniFi Dream Machine Pro, UniFi 24 port switch (powering two APs), 3x Dell R620s with a few SSDs and NVMe, and 2x Supermicros (one of which is the aforementioned backup server), each with a lot of spinners. Also some additional load from the overhead of the rack UPS.
I pay about $0.08/kWh, although with the base fee of $40 it's more like $0.11/kWh. In any case, it means I pay maybe an extra $30-40/month for my homelab, plus whatever additional heat load costs it places on my A/C.
If I moved to somewhere where electricity was significantly pricier, I would probably either invest in home solar, compress compute to a single node, or both.
> Sit back and relax? Being massively overprovisioned is a benefit of homelabs.
there are many dimensions to provisioning, not all of them are one ebay/amazon/newegg purchase away.
you could hardly get a symmetrical 10 gbps internet connection at home (in most places), and if you do it would be unlikely to be timely (and in that case, your business could probably be suffering).
Frankly, i think that the time when your startup is taking off might be the right time to start thinking about moving to the cloud (or to a proper datacenter).
If anything, if your startup is taking off then you're starting to get a real sense of what kind of compute and storage you actually need, and can maybe negotiate accordingly (eg: long-term committment for resources in some clouds give you very relevant discounts).
EDIT: regarding the internet connection... on a consumer connection, most contracts include a minimal guaranteed bandwidth that's usually way lower than the advertised peak bandwidth. i wouldn't be surprised to discover people getting throttled at those speeds if they start getting serious traffic...
The shitty code most places are running have such horrific latency between awful SQL queries and choices like Node as a backend that the difference between a 1G and 10G uplink is unlikely to make a large difference, especially if you’re caching static content with a CDN.
This does presuppose you have a business class internet connection, of course.
> you could hardly get a symmetrical 10 gbps internet connection at home (in most places), and if you do it would be unlikely to be timely (and in that case, your business could probably be suffering).
if your startup takes off and you can't get a professional connection on time... it might just drive users away. particularly paying users.
and depending on where you are, that might not even be possible at all.
Depends on what "taking off" means for the start up too. Taking off at a mass consumer scale might need that flexibility, taking off and even getting to market saturation in a specific B2B might be achievable on a raspberry pi level hardware. There are many more of the later.
Agreed - 1gbps is a surprising amount of bandwidth, you could easily host a fairly popular mobile app, saas, etc with plenty of breathing room. And in a lot of cases, you can just move your static file hosting behind Cloudflare, or onto something like s3, and give yourself even more room to grow.
Any business that you can run from a home server with a residential business line is not the kind of business we are talking about here. Yes, you can potentially serve a lot of customers with that setup, but your reliability story is terrible so you better have very forgiving customers.
What if your internet goes out? Even with a business line, I've had to wait five days for them to replace a fiber line that a squirrel chewed through.
What if the power goes out? I just had a five hour power outage. Even if you have a battery backup, when the neighborhood power is out for a while, the ISP equipment will die when its batteries go out.
What if your hardware dies and you aren't home to switch it out, assuming you even have spare hardware?
What if your A/C goes out and your server overheats and has to get shut down?
All of these are things you usually don't have to deal with when using the cloud or even a $5 VPS, because they design for all of these failure cases from the start.
If you're running a business from your house, it is by definition a lifestyle business, and that's not really what we are talking about here.
Any business that you can run from a home server
with a residential business line is not the kind
of business we are talking about here.
What kind of business are we talking about here? What does the "taking off" in your previous post mean, exactly?
Depends on what you're trying to do.
You are not going to be able to run a Netflix competitor out of your garage.
You're not going to get high availability without some significant investment and even then you'll be at the mercy of whatever your ISP is doing upstream in the event of a power outage. I live in an area where we average something like 99.99% power uptime, but not everybody is so lucky.
You could, potentially, host something that serves up something non bandwidth-intensive to tens of thousands of users, give or take an order of magnitude. (SaaS, APIs, etc) You can do a lot of interesting things with a homelab and some of them are potentially profitable.
Perhaps more crucially: you're not exactly locked in to a homelab. You can start with that and once you reach a certain point, migrate to colo or cloud.
The only realistic concern for me here is the ISP failure. Even then, if I really wanted to I could have both AT&T and Spectrum uplinks with an LACP bond.
I have enough battery backup to run my rack for about 30 minutes, more if I shut down a node. That’s more than enough time for me to set up my generator and route power; it has an extended gas tank and can power the rack, fridges/freezers, fans, etc. for over 24 hours. I periodically run drills on this to ensure it’s doable, and that the gear works as expected.
If I’m not home, then yes, the latter would fail. Dual hardware failure is an unlikely scenario; single node failure is handled between K8s and Proxmox clustering.
That would be dream come true! Everything is automated with Infrastructure as Code tools like terraform, plumi, dagger etc... You can easily point to another K8S and redeploy there then update the domain DNS records to divert traffic.
Additionally, if the company starts to grow you can hire other people who are familiar with the cloud provider (and it has extensive documentation). The last company I worked at that had stuff running on a rack in the office just had a "Keep Calm and Sudo On" printout taped to the cage and the guy who'd set it all up had quit.
App still needs to be written to be scalable. And if shit hits the fan moving to cloud (...or just renting dedicated servers at 1/3 the cloud cost) isn't too bad
This is precisely what I plan on doing when I eventually launch my app. I have 3x Dell R620s in a Proxmox cluster running K8s nodes, Ceph on NVMe with CSI-RBD, and a separate Supermicro 2U exposing ZFS for spinning storage. The only missing link is 10G networking, because 10G switches aren’t super cheap.
this is the most undersung hero. Internet speeds are reaching the equivalent of a datacenter. This will open up so many possibities, that it makes it seem like the 2000s again. And finally we might actually need ipv6
I just moved out of the consumer internet dystopia that is California (or most of US I guess) to Spain and am just about to get 10 Gbit symmetric for €25/mo. Even if I get half of that I’m ecstatic. This kind of infrastructure is so conducive to all kinds of interesting and “decentralized” innovations.
Yes indeed. My Bell.ca ~1gb fiber has a monthly cost of $100 + $20 for dedicated IP. Since it's business line fiber internet it comes with monitored service quality meaning it's prioritized over others using regular/residential fiber internet (claimed!)
I basically wrote the same thing a few months ago. TL;DR for most systems public cloud is not cost effective not secure. Another interesting Apple exploit using Spector on Safari on the front page today. Most systems better off self hosted with cloud tools to manage: https://open.substack.com/pub/rakkhi/p/why-you-should-move-b...
In my case Xfinity kept the same IP for me for two years, then an outage happened where everyone in the neighborhood lost connectivity. When connectivity was restored I got a different IP.
I feel like the biggest difference is the fact that there's no guarantee that the dynamic IP won't change, so all systems need to be prepared for that, or you need to be mentally prepared for that day.
Renting dedicated server off company like OVH, or even co-locating your own is also far cheaper than cloud, few times over, without fuss of turning your house into datacenter.
This post, along with this Simon Wardley thread, https://twitter.com/swardley/status/1088780650860158978?s=20, attracted me to finops as an engineering problem. Consumption based pricing is really tough for a lot of orgs to handle at the $1M+ range, so much so that it requires dedicated software to line things up.
I started helping maintain https://ec2instances.info in my spare time and the code base is literally full of IF statements to paper over AWS billing quirks. Later, I joined Vantage, one of the companies linked in the article.
It's a little bit undecided whether finops will have the growth that devops did but the problem is seemingly felt very acutely.
> attracted me to finops as an engineering problem. Consumption based pricing is really tough for a lot of orgs to handle at the $1M+ range, so much so that it requires dedicated software to line things up.
absolutely. massive struggle at the orgs I've been / am at.
processes for provisioning new resources haven't changed, and a lot of teams kind of push their own builds, or get reservations for X amount of space and dollars and then they can play with their space.
the result is OpEx shifts wildly from one quarter to the next, and control has been hit or miss. it continues mostly because projects need to deliver -- and do -- but that "do what you gotta do to make it work" approach has turned basically into the wild west and no one plays by the rules.
Yeah. In an infrastructure as code setup there's some more rigor that can be applied, at least in terms of tagging what's going out. But I've generally seen the finops team come in and optimize ex post facto.
Back in 2007ish the CEO of Parallels/SWSoft (the company behind Virtuozzo a very popular hypervisor used back in the day) stood on the stage of the keynote of HostingCon and said that the Cloud (AWS) was going to eat the WebHosting Industry's lunch. His entire keynote was very doom and gloom. Fast forward to 2015ish it totally it did. The days of VPS and Dedicated hosting being the only options for deploying your stuff for the Internet to consume were killed by the big three cloud providers.
In one hand I am very thankful that this caused a boom of innovation and tooling to make engineering operations easier and more accessible to our industry as a whole. If I had the tools I have now, managing 50+ racks in my old datacenter would have been so much easier.
But in the other hand, the promises the Cloud providers have shoved down our throats have been used against our Industry with raised prices, reduce freedoms on what providers you can choose, and consolidation of duties on Engineering staff. The average developer has to do more work, across many different domains to get paid well. The war waged to kill off Systems Administration as a core job responsibility and using the DevOps movement to kill FTE count for specialized work on bare metal or datacenter infra keep us locked into using Cloud APIs to do our jobs. I think this really sucks.
I do think this article hits home on some points I have been seeing over the past few years. The costs are too high for the goods provided in comparison to getting a couple of racks in few datacenters and DIY which can for sure work for the right kinds of companies at the right stage of growth. I also think Hybrid deployments of Cloud + Bare Metal in a datacenter is so much more of a viable option for less mature companies. The pendulum is starting to slowly swing back to doing things on Bare Metal.
I am excited for the next 10 years of progress and I hope the Cloud still remains a powerful tool in our toolbox, but it isn't used for everything. The cost savings are too real as this article points out when you don't use the cloud and you'd be surprised how much easier it is these days to get some bare metal deployments going in your eng org.
> keep us locked into using Cloud APIs to do our jobs
I love that we can use Cloud APIs to do our jobs. That's piles better than negotiating with (and waiting for) an Ops team to provision you hardware or networking that you don't have much visibility into and then relying on them to troubleshoot.
I don't mind paying more for that (which you absolutely do on a per-unit of compute, network, or storage).
Don't get me wrong, I love using them too. They are great for getting something setup quickly when you are getting started. I use some cloud services that support my single datacenter rack I have today. For my expensive workloads, I run them off three beefy bare metal servers in my rack.
I'd argue you have less visibility on what is going on at a cloud provider than you did in the past with your Ops team. Support tickets without a massive yearly spend at a cloud provider are pain and suffering. Even with over 1 million/year spend at one of my previous companies, support and visibility on underlying issues was a very poor experience.
From the developer angle? I'm not so sure. It's an N of 1, but I ran the Ops team for an e-commerce company that you're 50/50 likely to have heard of. I'm proud of how skilled and hard-working my e-comm systems/storage/network/DBA/DC-ops team of around 20 was, and the overall very good operational results we delivered.
Even with that, AWS's out of the box offering utterly dominates what we could offer as a medium-sized operation. You might have 5 arbitrary points' worth less visibility behind the scenes, but you need 15 fewer points of visibility behind the scenes. (We're also on enterprise support and have an excellent support & solution architecture team attached to our account, but even for my side projects where I have no support plan whatsoever, I have enough visibility into how to make things work.)
Look - if you can optimise your workload over a capital cycle cloud is not for you. Especially if you are a technology company and have the bandwidth to deal with servers and ISV's properly.
If you are a changing, growing, non-technology company then cloud is the place you should be. You will get screwed by Oracle, you will get screwed by Accenture. You will not be able to optimise your workload and you may as well get screwed by aws/azure/gcp and not think about infrastructure that much instead.
Good luck getting the time of day from Oracle or Accenture if you are not already quite a large established company. Not that you want it anyway though.
I think it's a bit nutty that folks always use Dropbox as the example of "cost savings by doing it yourself".
I mean, Dropbox itself is essentially a provider of cloud storage. It seems like owning their own data centers was obvious from the beginning: Use AWS initially to scale, but given the size of Dropbox it obviously doesn't make sense to have another middleman in the mix.
I just point this out because Dropbox is so unlike your average enterprise user of the cloud. I'd like to see some better examples than Dropbox of enterprises saving more (or innovating as quickly) by going on their own.
I work at a company that recently tried to shift our (on prem, sharded, 2-metro, 6 replica) mysql deployment to AWS. Any way we sliced it, even after steep discounts, we were seeing massive TCO increases, often up to 10x - the math just doesn't work out for large deployments. It's really a shame that all these companies are forced to write their own (crappy) version of AWS - but price is a feature when it comes to this stuff.
Martin Casado is one of the smartest people I've met in my career in IT (AWS, VMware, etc).
This phrase is key:
"You’re crazy if you don’t start in the cloud; you’re crazy if you stay on it"
Perhaps I'd say that "you're crazy if you keep all your IT on it, as you scale".
There's still a case to be made for a small % of your workloads in the cloud. The flexibility is priceless. But for everything else, there's a well managed on-prem.
Side note: interesting how, after Mark Andreessen's disastrous post on techno-optimism a few days ago, I approached the reading of this article with rolled eyes and low expectations. Seeing Marting Casado as the author immediately changed it, however.
> "You’re crazy if you don't start in the cloud; you're crazy if you stay on it"
Spot on. Gonna have to steal that for my next meeting.
Although, given the potential Cavium-Marvell TPM issue viz ECC and
that we're not really there with homomorphic compute, I might make a
caveat for developers working on very sensitive new ideas; Keep it
on-prem until you have a good enough security segmentation to push out
what you're happy with on untrusted cloud infra.
> Perhaps I'd say that "you're crazy if you keep all your IT on it, as you scale".
And by scale, I think companies should be at least in the double digit millions spent in cloud infrastructure before they start building out their own. Everybody is used to the cloud APIs and going back to the old days of managing things in a static datacenter are not coming back for these companies.
3 years ain't what it used to be - A100s were launched over 3 years ago and lots of people are still happy when they can get hold of some on AWS. People buying H100s now will very likely still be getting good value out of them in 2027.
> But for everything else, there's a well managed on-prem.
That’s begging a number of questions, however - the number of places who can approach a cloud service portfolio at all, much less with a cost savings or security parity, has to be at least an order of magnitude smaller than the number of places who incorrectly thought they could.
It does beg the question, but your point about orgs "incorrectly" thinking they could move some parts of their portfolio off-cloud begs several more. I would like to see it fleshed out a bit, into some high level rules of thumb, to guide, e.g.
IaaS Cloud for:
Flexibility; very spikey compute; modest to normal egress; low to moderate interruptible AI workloads (use spot); modest to normal storage, large if properly tiered and tradeoffs ok; lower effort HA; client reqs (but make them pay for it); security model (non-portable); Integrated turnkey DBaaS;
Iff you have FinOps, careful scale-in policies for DevOps, or are stuck non-portable
Off-cloud (CDN, Hyperscale compute, *aaS, on-prem)
Fit on a $5 VPS; Have simpler neat 12 factor PaaSy container workloads (Fly/Vercel); Well-defined heavy workloads; heavy egress (use CDN); heavy non-interruptible AI/specialist compute; large storage (CDN for objects) or compute-local with high IO (colo, onprem); HA with more effort; Specialist DB or boutique DBaaS (Supabase); Portable security (k8s)
Iff you have invested in portability (containers, k8s, FaaS frameworks, interruptible workloads, high level APIs), HA testing and appopriate security
I've always thought the mysterious bit was that cloud doesn't seem like it ever gets cheaper while the cost difference between cloud and own infra and profit associated with AWS suggest it could be. Is there something structurally wrong with competition in cloud?
* Egress pricing margins ensure lock-in which makes it hard to builds competitors to “commodity” services (eg you can’t spin up your own price-competitive S3 within AWS). Lack of competition means less innovation and more expensive pricing.
* While compute costs don’t necessarily come down, the CPUs get more powerful. At scale, this should be the same as prices coming down. However, it’s not exact and cloud provides pocket this difference as profit / R&D investment.
* SRE costs are a huge chunk of own infra (managing servers at scale). If you’re small, this is a negligible cost. If you’re a large business this is a huge cost. Cloud providers target large businesses so the savings when you’re smaller are less obvious.
* Elasticity is a huge part of cloud capabilities. Most people use dynamic paygo pricing which is more expensive than baseload demand which cloud provides typically discount because baseload revenues can be used for purchasing additional capacity.
That's not correct. Cloud costs have steadily declined since ~ever. This is due both to the inevitable march of performance, and inflation. It used to be 1 Ivy Bridge vCPU cost $71/month, in 2014 dollars, or $90 today. But today you can get a machine of the same size class, with a dramatically faster CPU, at $30/month.
- a Intel 3770K was released in 2012 for $330 retail, or 20 CPUMark/2012$.
- a AMD 7800X3D was released in 2022 for $450 retail ($330 in 2012 dollars), or 105 CPUMark/2012$ or 77 CPUMark/2023$
Looks like consumer performance/price has improved by 5.3x.
I'm not sure what exactly the math works out here for cloud hardware, since I don't really have any performance numbers. I'd still expect it to be quite a bit less improvement, since it "feels" like server CPUs have gotten better at cramming more cores on a die, rather than increase each core's performance. Looking at Xeon Platinum 8480+ (2023, MSRP $10k), there's 56 cores@3154 ST CPU Mark, vs. Xeon E7-8890 v2 (2014, MSRP $6.8k) has 15 cores@2175 ST CPU Mark.
Assuming your cloud prices are correct, and with these rough performance figures, you're getting a 4.5x cloud performance/price improvement, significantly worse than with real hardware.
The capex of the computer is only about a quarter of the TCO over a 3-year amortization. I don't think it is as relevant as you seem to believe. However, I have definitely seen some naive analyses along these lines written by people with a predisposition for moving out of clouds.
Good luck trying to write down that number. The cloud has list prices. Colo facilities can and will demand 500% rent increases when it's time to renew. You can't predict it.
AWS was first to market for many years, and their margins were always higher than ecommerce retail (and that's easy to believe, since retail often runs on 1% margins)
> A billion dollar private software company told us that their public cloud spend amounted to 81% of COR, and that “cloud spend ranging from 75 to 80% of cost of revenue was common among software companies”
I find that quite hard to believe.
I mean, I can see that if you were selling cloud backup services and stored customer data on S3, I can understand S3 costs being a big part of your budget.
But for the vast majority of businesses - I'd expect a supermarket selling a $50 basket of groceries to spend maybe $0.01 on database storage and CPU and whatnot.
I don't see that a company like Slack, which has revenue of about $8.75/user/month would need to spend anything like that much on cloud costs.
I've worked at a few SaaS startups that did not price their service correctly and did not want to pay up front for reserved instances. The cloud costs were very high. 80% of revenue was about right.
A chart on the article said that Slack spends an estimated 41% of revenue on cloud costs, calculated using their financials by this author who claims they were being conservative but I don't know how to estimate such a thing myself.
This is high, unless you are a platform as a service, effectively selling the compute on to customers. Add on the free tiers subsidy (for growth!) and it make sense then.
If you are selling just software and 80% of that is on the cloud you are undercharging or wasting crazy resources.
It cuts the other way, too. A cost for one is revenue for another: In this case, AWS or Microsoft. Your cost of revenue is their revenue!
There will be big growth in the "make it easier to manage on prem deployments" sector. Obviously you need capable sysadmin if you are going to move workloads away from cloud but improving the tech (I'm thinking of Oxide Computer here) will make it palatable for smaller and smaller orgs to consider repatriation, as they can do so with less effort and expertise.
This makes me hopeful as well that internet infrastructure will become less centralized. It's closer to the spirit of the internet that not every bit of information you see online passes through us-east-1.
It's my hardcore held belief that the cloud industry could wildly be disrupted if it was easy enough to order 5 year capacity contracts. Then AWS / GCP etc would be left with just the elastic demand, rather than the inelastic profit center.
However - this would take a huge amount of upfront investment, have high risks for black swan events against 5y contracts without crypto-style escrow, and I have been unsuccessful pitching this direction in elevators to folks, which makes me think I don't have a visceral enough idea.
Lambda went from bragging about short contracts [0] to 3+ years and if you look closely, the only thing with actual pricing is 3 years, otherwise it is contact sales and waitlist.
5 years is a really long time on most types of B2B contracts. Especially ones with very thorough operational considerations bound to technology changing over time.
Consider how much premiums would be on $TSLA put options that expire in 2028.
Sub-linear scaling is powerful. Think about how costs scale in relation to use of a service.
If you are paying someone else for the service as a package, the costs scale linearly, or some close approximation of that. AWS gives you a small discount if you have larger volumes for some services, but these are essentially a rounding error.
If instead you launch a similar service in house, your costs are often dramatically lower per unit the more units you use. You start with a small team supporting a system with some large capacity you're not using yet because you need room to grow, and you realize large steps along the way where you can handle tens or hundreds of times the volume you see at each staffing level. Buying hardware or developing custom software gets cheaper at volume. This is sub-linear scaling.
solving a paradox sometimes requires thinking out of the box
repatriation is not the global solution, it applies to mature and large companies
the dimension to explore is co-ownership of infrastructure by multiple independent companies, both small and large
a captive entity - wholly owned by member companies - manages a vast pool of resources achieving both the economies of scale and the flexibility needed by growing companies.
you don't "rent" somebody elses linux server, you invest and own a share.
R&D for evolving the platform is something outsourced to third parties (maybe even current cloud providers)
If moving off the cloud was easy we’d all be doing it. For our org the risk is (unfortunately imo) bigger than the savings. I would love to pay hetzner dedicated boxes rates for compute but the amount of work to migrate everything would be huge.
Not to mention the difficulty in hiring people who are actually comfortable with doing the infra work of setting up all our services, with the required multi-AZ config, on bare metal. Its hard enough finding people who can reliably setup services in aws with terraform and keep them alive.
Any discussion of cloud vs. data center vs. on premises that does not discuss employee costs, documentation, and training is not complete.
One of the main reasons for using cloud infrastructure is that many people know how to use it, internal documentation requirements are dramatically lessened, and you don't need to higher infrastructure IT employees to maintain and monitor it.
If cloud only costs double internal IT, it's almost certainly cheaper until you reach a billion dollar valuation - or more depending on what your infra needs are.
Dropbox has the pay, clout, and pipeline to have zero problems with recruiting talent.
Can the same be said of all F500 companies that may want to basically engineer their own cloud? Because the dev team is almost certainly used to all the convenience of the cloud, so you'll need a good infra team.
the benefit of the cloud is the elasticity.
People neglect sometimes that businesses had to scale their infrastructure to peak demand before the cloud.
If your workloads don't have peaks it's an easy problem to solve.
We had to purchase/lease HP storage servers with 10x the storage we needed for our photo startup in 2006. The lead time to get a server was weeks so there was very little elasticity to respond to non linear growth.
Would have been a lot easier if we had S3 (launched around this time, no one really knew about it or what it was).
What about a good way to to judge the cost of the cloud - pricing has always been overly complex and arcane and full of gaps in understanding the overhead involved
Honestly the relative clarity of what cloud costs has been one of the advantages that companies want to pay a premium for, compared to the quagmire of capex, depreciation, attribution of datacenter costs, and interdepartmental chargebacks that on prem usually means.
I suspect fewer cloud migrations would occur if companies could map their onprem costs to specific products much more easily rather than going "Oh it might be a lot for this team but who knows?", which opens the room to someone to sell the pricing transparency use case to an exec.
Now you can operate your own local cloud (Dokku, Rancher, Proxmox, VMWare, MinIO, k8s, you name it). This gives you all the simplicity of deployment and accountability, and some elasticity.
It adds a bit of ops complexity at the start though. But running your own rack starts to make sense when your AWS bill is at least the same order of magnitude as an SRE salary; at that point, you already have needs that are varied enough.
I'd say for a lot of companies the Fed wake up call has happened that they have been doing a lot on reducing cloud costs. They did not necessarily move away from cloud but instead sign long term contracts and remove dead or undead services.
Here we go, the biggest success of AWS is making people believe there's no alternative.
- there are a lot of companies that nowadays operates their computing infrastructure.
- there is a lot of space in the middle, you are not forced to choose just over full self-host and full-cloud, there is colocation, hosting, providers like hertzner, digital ocean and so on.
I think the biggest thing is that the minimum extra spend if you want in house managed infra is basically the salary of a full time sysadmin + hardware costs.
You can pay for a lot of cloud resources for that much, even when the cloud resources are massively overpriced compared to what you could manage if you had more scale.
The only other alternative is finding someone who's good enough at it to make that full time sysadmin job just a small part of their "actual" job responsibilities, but that person is pretty hard to find.
> I think the biggest thing is that the minimum extra spend if you want in house managed infra is basically the salary of a full time sysadmin + hardware costs.
The cloud promises to make admin tasks easier, however I have never seen it eliminate the role of a sysadmin in practice. In my experience, most organizations that run their infra on a cloud still have dedicated admin roles (often called "DevOps" or infra teams).
Hence, I think that the claim that a sysadmin's salary is an extra expense for self-managed infrastructure is exaggerated. You may need more sysadmins for achieving the same features in a self-managed setup compared to the cloud, but it is not a linear scale.
The other thing is remembering it’s not one full time sysadmin but many and also things like security staffing. One dude running some Linux boxes isn’t going to give you the kinds of audit logging, monitoring, encryption, etc. which is table stakes in the cloud world. You can hire those people, buy HSMs, etc. but then you’re talking a much bigger gap before you break even.
Then you get a step function cost increase building and operating all of the services your developers aren’t getting out of the box.
Probably, yes? Operating it to their specific needs is likely to be more efficient than operating to the generalized needs of cloud customers.
IMHO, the biggest wins for cloud are when you can't fill up a whole machine and when your needs vary significantly throughout the day and you can scale up and down quickly and not have to pay for unused capacity.
> this makes no sense: can individual companies operate computing infrastructure more efficiently than cloud providers?
As long as you know your predicted growth yes. Cloud providers are operating "efficiently" because they do oversubscription on a lot of services. You don't really think that those 8 vCPUs are dedicated to your VM only, do you?
Also, the thing with cloud are the managed services that make it easy for developers to process/exchange data, like Amazon SQS or cloud functions and things like this.
Like the first comment says, if you know what you are doing, it is quite easy actually to host your infrastructure on-prem.
What do you mean? I did the AWS negotiation for a small but rapidly growing startup and that was one of the negotiables, there was no real problem getting discounts.
Nowadays internet speed is great to do self hosting. I have a business line internet at home with ~1gb up&down! Bought couple of 6-7 year old enterprise Dell servers (2x12core xeon, 128gb ram each) and no longer pay any cloud provider ... i'm also hosting 2 backend solutions for mobile apps with decent traffic for friends' startups!
The learning experience has been tremendous! It has actually gotten a lot better and easier with new solutions coming out for homelabs. Get started with Proxmox clusters and go from there...