This is more just "missed optimization opportunities in EC2" than a statement about mistakes in AWS as a whole.
If you want to talk systemic AWS mistakes you can make, we accidentally created an infinite event loop between two Lambdas. Racked up a several-hundred-thousand dollar bill in a couple of hours. You can accidentally create this issue across lots of different AWS services if you don't verify you haven't created any loops between resources and don't configure scaling limitations where available. "Infinite" scaling is great until you do it when you didn't mean to.
That being said, I think AWS (can't speak for other big providers) does offer a lot of value compared to bare-metal and self-hosting. Their paradigms for things like VPCs, load balancing, and permissions management are something you end up recreating in most every project anyways, so might as well railroad that configuration process. I've experienced how painful companies that tried to run their own infrastructure made things like DB backups and upgrades that it would be hard to go back to a non-managed DB service like RDS for anything other than a personal project.
After so many years using AWS at work, I'd never consider anything besides Fargate or Lambda for compute solutions, except maybe Batch if you can't fit scheduled processes into Lambda's time/resource limitations. If you're just going to run VMs on EC2, you're better off with other providers that focus on simple VM hosting.
> we accidentally created an infinite event loop between two Lambdas. Racked up a several-hundred-thousand dollar bill in a couple of hours
May I ask how you dealt with this? Were you able to explain it to Amazon support and get some of these charges forgiven? Also, how would you recommend monitoring for this type of issue with Lambda?
Btw, this reminds me a lot of one of my own early career screw-ups, where I had a batch job uploading images that was set up with unlimited retries. It failed halfway through, and the unlimited retries caused it to upload the same three images 100,000 times each. We emailed Cloudinary, the image CDN we were using, and they graciously forgave the costs we had incurred for my mistake.
> May I ask how you dealt with this? Were you able to explain it to Amazon support and get some of these charges forgiven? Also, how would you recommend monitoring for this type of issue with Lambda?
AWS support caught it before we did, so they did something on their end to throttle the Lambda invocations. We asked for billing forgiveness from them; last I heard that negotiation was still ongoing over a year after it occurred.
Part of the problem was we had temporarily disabled our billing alarms at the time for some reason, which caused our team to miss this spike. We've enabled alerts on both billing and Lambda invocation counts to see if either go outside of normal thresholds. It still doesn't hard-stop this from occurring again, but we at least get proactively notified about it before it gets as bad as it did. I don't think we've ever found a solution to cut off resource usage if something like this is detected.
Earlier in the week there was threads about how AWS will never implement resource blocking like you're talking about because big companies don't want to be shut off in the middle of a spike of traffic, and small companies don't pay enough money, and it's not like it hurts Amazon's bottom line
We use memory safe languages, type safe languages. AWS is not fundamentally billing safe.
Just to give you nightmares. There's been DDoS in the news lately, I'm surprised nobody has yet leveraged those bot nets to bankrupt orgs they don't like who use cloud autoscaling services.
I don't know how you monitor it, part of the issue is the sheer complexity. How do you know what to monitor? The billing page is probably the place to start - but it is too slow for many of these events.
I guess you could start with the common problems. Keep watchdogs on the number of lambdas being evoked, or any resource you spin up or that has autoscaling utilization. Egress bandwidth is definitely another I'd watch.
Dunno, just seems to me you'd need to watch every metric and report any spikes to someone who can eyeball the system.
For me? I limit my exposure to AWS as much as I reasonably can. The possibilities combined with the known nightmare scenarios, with a "recourse" that isn't always effective doesn't make for good sleep at night.
> There's been DDoS in the news lately, I'm surprised nobody has yet leveraged those bot nets to bankrupt orgs they don't like who use cloud autoscaling services.
> There's been DDoS in the news lately, I'm surprised nobody has yet leveraged those bot nets to bankrupt orgs they don't like who use cloud autoscaling services.
That’s interesting because I seems like it would happen, but what is in it for the attacker, whrn under threat they can implement caps?
A severe enough bill can cause an organization to be instantly bankrupt. No opportunity to try to do something like caps.
Regardless, turning on spending caps isn't a final solution to this particular attack. With caps the site/resources will hit the cap and go offline. Accomplishing what a DDoS generally tries to accomplish anyway.
The only real solution is that you have to have a cheap way to filter out the attacking requests.
Could only be an attack of spite, can’t really hold a ransom because the IPs of malicious traffic could be blocked or limits set after initial overspend. Perhaps if the botnet was big enough.
I think you're limited to 1,000 concurrent Lambda invocation by default anyway.
That said, it's not easy to get an overview of what's going on in an AWS account (except through Billing, but I don't know how up to the moment that is).
I've been able to get AWS support to waive fees for a runaway Lambda that no one spotted for a few weeks - they wanted an explanation of what happened and a mitigation strategy from us and that was it.
It is still unresolved because AWS wants us to pay the bill so they can then issue a credit but the company credit card doesn't have a high enough limit to cover the bill.
>"Racked up a several-hundred-thousand dollar bill in a couple of hours."
This is enough to rent big server from Hetzner / OVH for like forever and have person looking after it with plenty of money left.
>"I've experienced how painful companies that tried to run their own infrastructure made things like DB backups"
I run businesses on rented dedicated servers. It had taken me a couple of days to create universal shell script that can create new server from the scratch and / or restore the state from backups / standby. I test this script every once in a while and so far had zero problems. And frankly excluding cases when I want to move stuff to a different server there was not a single time in many years when I had to use it for real recovery.
I did deployments and managed some infrastructure on Azure / AWS for some clients and contrary to your experience I would never touch those with the wooden pole when I have a choice. Way more expensive and actually requires way more attention than dedicated servers.
Sure there a cases when someone need "infinite scalability". Personally I have yet to find a client where my C++ servers deployed on real multicore CPU with plenty of RAM and array of SSD came anywhere close to being strained. Zero problems handling sustained rate of thousands of requests per second on mixed read / write load.
I think your last paragraph is the sales pitch for AWS. Hiring that level of expertise doesn’t scale.
Easier and cheaper to hire 10x as many “developers” and pay the AWS bill than headhunt performance gurus that understand hardware and retain them .
What expertise? My specialty is new product design. I am very far from being performance hardware guru. I just understand basics and do not swallow propaganda by loads.
I'm not saying it can't be done cheaper or more efficiently on simpler providers or even self-hosting, but you need the expertise and time to stand up the foundation of a secure platform yourself then. For example, AWS Secrets Manager is just there and ready to code against, as opposed to standing up a Vault service and working through all of the configuration oddities before you can even start integrating secrets management into an application. If you already have a configuration-in-a-box that you can scale up, then more power to you.
Your use-case of running a web service that is written in a very efficient language like C++ is not something you see too much these days. While it would be nice if most devs could pump out services built on performant tech stacks, our industry isn't doing things that way for a reason. Even high-prestige companies with loads of talented engineers only build select parts of their systems using low-level languages.
>"Your use-case of running a web service that is written in a very efficient language like C++ is not something you see too much these days"
In some place including big ones it is very much being used.
>"our industry isn't doing things that way for a reason"
I think the real reason is - the slower your stack the more money you will pay to Amazon, Azure, Google or whoever else. And by way of advertising, trickling down to education and lots of other means they make sure that this is what everybody (well most) uses.
>"using low-level languages."
Since when modern C++ is "low level". It is rather "any level". I compared my C++ server code with the similar ones written in JS, Python, PHP etc and frankly if you skip standard libraries C++ code can end up being actually smaller.
> > "Racked up a several-hundred-thousand dollar bill in a couple of hours."
> This is enough to rent big server from Hetzner / OVH for like forever and have person looking after it with plenty of money left.
That's no fair comparison, as you're comparing the cost of a worst case caused by a misconfiguration under very specific circumstances with the cost it takes to operate the service without such a worst case.
If you want to avoid any possibility to generate costs like that by accident you of course are better off with self-hosting. However even then generating such costs is certainly possible, e.g. by accidentally leaking a database with customer data through a misconfiguration.
Without assuming such worst cases AWS Lambda can be much more cost efficient than a dedicated server, depending on the use case.
There is no silver bullet. For some use cases self hosting makes sense, for other use cases using a cloud provider is the better choice.
AWS value comes from things like
* RDS - easy db backups
* CloudWatch - easy monitoring
* IAM - easy access control
* Systems Manager - easy fleet management and distributed parameter store. Integrated with IAM so you can hide your secrets.
The list goes on.
If all you need is one server then you don’t need all of that. Things change as soon as you need 40 servers, or you have 40 people accessing 10 servers.
You can do it with open source tools. It takes time and expertise to do so. Both expertise and time are not available to the most companies.
> If you want to talk systemic AWS mistakes you can make, we accidentally created an infinite event loop between two Lambdas. Racked up a several-hundred-thousand dollar bill in a couple of hours.
I did more or less the same thing, but with a 3rd party webhook. The bill almost killed my company.
The resource usage required to tank a small startup (that could’ve become a bigger customer later) is probably peanuts to Amazon. I’m not sure how often they do this (or whether they do it at all) but it would make business sense for them to occasionally grant “billing forgiveness” in serious situations.
Of course Amazon can sponsor those companies hoping that they'll bring more profits in the end. But that's not a guarantee, just a good will, may be depending on mood of support person who'll handle that specific case. I made a mistake with Amazon in the past which costed me $100 and I did not get a refund despite asking for it. I had some bad sour in my mouth, but whatever, my mistake, my money.
It was a 3rd party resource that when updated would call a lambda via a webhook, which would then update the 3rd party resource. So it would create an infinite loop for each resource that was modified.
At Poll Everywhere we run a few high volume SMS short codes. Somehow somebody texted one of our short codes, which replied to an Uber SMS phone number, which replied to our short code, which replied back to Uber, which replied back to our short code, which replied back to Uber, which replied back to our short code…
After a few days of this we racked up a bill that I think was $10,000’s in SMS fees before we noticed the problem and terminated the loop.
It’s pretty crazy the problems you’ll run into given enough time and scale.
Things AWS solves for me that I've always wanted to have solved:
* Database administration
* Security best practices by default
* Updated infrastructure
* Automatic load balancing
* Trivial credentials management
* 2FA for all infra administration
* Container image repositories
* Distributed file systems
I was and old-school bare-metal UNIX systems admin 15 years ago. Each of those things, in medium to large companies, would take a full-time sysadmin to keep it all up to date.
Same. I used to really hate "cloud" given I was old-school devops and could do all of these things listed above with self-written management tools and a sprinkling of Ansible. However, I think in the past year, I've seen the light, shaved off my neckbeard, and have really enjoyed the lightening of shoulders of not having to think about things problems that no longer exist.
Many companies want disaster recovery and multi-region deployments without the capital expenses required to deploy this themselves.
I don't want to have to buy hardware from a vendor, find cabinet space, negotiate peering and power agreements, deal with 3am alerts for failed NICs, or hear about someone spending hours freeing up disk space while waiting on new drives to arrive.
I want the benefit of all these things, but I'd rather pay a premium for it over time than deal with the upfront capital expenses.
The problem is that not everyone wants to self-host, not everyone wants to manage hardware, and not everyone's tech scales in an extremely predictable and easy way. We launched a new tenant that required a bunch of new EC2s, databases, etc. Was trivial with AWS with terraform. If we did our own homegrown solution we would have had to have that hardware either ordered and waited on or have that hardware ready in reserve just burning cash doing nothing.
AWS is complexity-as-a-service. This is why, as a one-man company, I went baremetal[1]. One flat price, screaming fast performance, and massive scalability if you get a beefy enough machine[2]. I don't have time to fiddle with k8s, try to figure out AWS billing/performance tradeoffs, or deal with untraceable performance issues due to noisy neighbours and VM overhead. My disaster recovery plan is a simple DB dump script to S3, and I know I can get another baremetal server up and running in less than 20 minutes.
The comment on "complexity-as-a-service" resonates. IMHO, it's primarily because they want to make a product out of everything, including stuff companies build to manage their own AWS implementations. Instead of a simple list of products, its a complex list, with lots of nuances per each service offering. The other day, I was giving a high level summary of cloud technology to an intern; there was a point where I couldn't even find the AWS service I was telling her about from the product list, which annoyed me. Maybe that's more a comment about the marketing site though, but still, when your product catalog gets that big, its hard to avoid ridiculous levels of complexity.
From that Let's Encrypt article: "We have a number of replicas of the database active at any given time, and we direct some read operations to replica database servers to reduce load on the primary."
I tend to agree with this. AWS etc is nice if your scale is big enough that you need to run a big cloud of dozens of servers with complex interconnections for security etc. If a single plain old server with database etc on it will do the job fine, much better to stick with that.
A bare metal Postgres install needs optimization, and a working backup and restore plan (you did test your backups, right?).
That's half a day of work lost to get your system set up.
Now your app keeps serious data and you want a read replica. How long does that take?
Now you need a separate development environment. Here you go again, adding a few hours of work.
Then you need to update your database version. Gotta read the changelog and make sure you did everything right, and do it in a reaonsable change window.
You just racked up several day's worth of work, and for a DB instance with a similar amount of infra work done, the RDS solution is way cheaper and easier to provision.
If your time is worth money, there's no reason to go bare metal.
Why does my bare metal Postgres install need optimization? My sites mostly doesn't get much traffic, and it runs fine as-is. It'd be silly to try and optimize it without being able to measure what's actually slow.
Backup systems should also be set up according to desired reliability. I have a 10-line bash script that pulls a DB dump, zips it, and sends it to S3. Under 5 minutes to install, including setting up a new AWS role and keypair for it, just have to add in some Ansible commands I already have set up, and set a cron job to run once a day.
Read replicas are nice for some applications, but not needed for any of my current ones. I probably wouldn't want to set one up on bare-metal admittedly, but I'm not worrying about it until I need it.
I don't see a need for a separate cloud deployment for a development environment for my current application either. Would be nice if I had multiple developers and testers working on it, but I don't now.
Never needed to update the DB version, and the traffic is low enough that I don't need to really care about keeping reasonable change windows if I did.
So nope, 10 minutes of work for a low-traffic application. Meanwhile, a AWS RDS setup is easy to start, but then you have to muck with security groups, VPCs, permissions, etc to get it working right. That's not necessarily easy if you don't already make use of that stuff.
There's gonna be a lot of judgement calls here, but I think a local Postgres install is the happy medium. A basic local install is super quick and easy to do, and it also has a lot of good scalability options. SQLite is okay as long as you stay small, but it doesn't scale up very well.
If I suddenly get big and my database is Postgres, then I can spin up my own dedicated server, or switch over to RDS after all, optimize concurrent queries, hire a DBA with scaling expertise, etc. Most of that isn't an option with SQLite. I'd have to switch to a completely different database engine. Despite the promises from ORM writers, I've never seen this go smoothly, and it would have to be done at the worst possible time for it.
> Then you need to update your database version. Gotta read the changelog and make sure you did everything right, and do it in a reaonsable change window.
What, as opposed to needing to read the RDS patch notes, and schedule the maintenance widow?
You might need to read the Postgres version changes for your internal databse code, but you're not gonna need to devise and execute a procedure for upgrading your database, which is essentially done with a single click in an unsupervised manner.
- Ansible for the low-level stuff (like network, mounts, iSCSI, configuration files)
- Terraform for high-level stuff (like DB users)
In my case, as I have several services that use a lot of RAM running, I couldn‘t afford The Cloud but can easily afford a colocation. I don‘t mind the maintenance (it‘s a couple hours each month) and I don‘t care much if services are down a few hours.
If you need something running 24/7 with 99.9%, colocation will be more expensive just because of the human you need.
Not sure about op but you have to put forth quite a bit of effort to get declarative infra with Ansible. Some of it is declarative out of the box but a lot is imperative.
The main difference, if I revoke a DB privilege, I have to add a line to Ansible with a REVOKE in most cases versus Terraform you just delete the config line and the tool realizes during its diff stage and performs the removal change (it's stateful and declarative)
Does terraform has any drifts for example if I remove a db from terraform is it going to.remove the db from prod? Can I trust it's always the same or do I have to do cleanups?
Terraform is declarative, meaning: The stuff you write into the terraform files (TF) is exactly the stuff you get configured.
Example: you add a new database and a database user to the TF files, run terraform plan & apply, you do have the database and user configured. You remove the user and run plan & apply, terraform does remove the user.
It goes the other way too: if you add additional port rules to a network security group (e.g. in AWS) by hand, terraform will remove them when plan & apply because they are not defined in the TF files.
So in conclusion: for your declared state, there will be no drift.
I do not see anything less feasible doing it with threads / forks.
Asynchronous IO like in Go I think is useful in a sense that if some of the networked responders are temporarily slow then they will not hold fast ones and would not accumulate threads. If however it is disk / database, then by immediately switching to a next method this strategy will let internal async tasks piling up indefinitely and saturating resource pools of OS just as well. So those "concurrency focused" languages are not a silver bullet and one has to understand internal mechanics to properly manage requests lifecycle.
Shifting any non trivial infrastructure into AWS verbatim is always more expensive than running it yourself. You need to rearchitect it carefully around the PaaS services to make a cost saving or even break even.
An extreme example of this is it cousin who works for a small dev company doing LOB stuff. They moved their SQL box into EC2 and it’s costing more to run that single RDS instance than their entire legacy infra cost was per year.
I’d still rather use AWS though. The biggest gain is not technology but not having to argue with several vendor sales teams or file a PO and wait for finance to approve it. All I do is click a button and the thing’s there.
> The biggest gain is not technology but not having to argue with several vendor sales teams or file a PO and wait for finance to approve it. All I do is click a button and the thing’s there.
This is so ridiculous. I have to argue endlessly (and again for every employee) with IT support and enterprise security to give them the ability to upload attachments on Teams.
But giving that same person access to start a few $100/hour instances on AWS? No problem.
The balance is completely out of whack once your infra is on AWS.
But that's the thing. A senior engineer is also costing a business ~$100/h (all in) so even if you accept that there will be a fair amount of waste (misconfigurations, devs spinning up boxes and then forgetting about them, PoC projects never torn down, etc) it can still be a net-positive proposition. People always want to compare the cost of compute/storage/bandwidth but that isn't really the value proposition. Of course you are going to pay more for cloud-hosted infra than for equivalent infra in a colo or on-prem DC. But I used to spend hundreds of hours a year doing random busy work to deal with on-prem infrastructure. Need a new server, put in a ticket and when the ticket gets no response follow up with emails and finally setup a meeting to discuss with the ops team. Need a new firewall rule, same deal.
Have them build an s3 attachment service and just let finance know you are spending $X/yr and you have ideas for streamlining IT support that would eliminate the cost.
It's always more expensive to have someone else run your infrastructure than to do it yourself unless it's something you only use intermittently.
If you need 5 seconds of compute time per day then running that as a Lambda makes perfect sense. If you need a database server that's available 24/7 then I can't see how hosting that on Amazon could be cheaper.
(Unless you're employing a full time ops person to look after that one server, in which case you'll have to do your own maths.)
Previous job, I ran website on RDS for about 7 years and only touched the control panel to restore from a backup when we'd screwed up the data, and to tell it when to upgrade. That was worth quite a lot in ops time and piece of mind.
Database crashing on a hosted service is such a rate event unless you mess it up yourself running rouge queries. No surprise there. It doesn't have to be AWS though. Can be Digital Ocean or anything else.
Well that 24/7 db server is not going to back up itself and maintain itself when the HW fails or the power goes out. This is not trivial/cheap to assure and maintain. Likewise with hosting your own S3. The value is not in the HW really (and the cloud providers know this when pricing).
I have this setup on AWS and look at "bringing it home" every year but the nightmare of having to assure a good level of availability is not worth the saving of a few extra $1000 per year for us at least. Completely different issues not related to your HW can happen, your office internet simply goes down or power goes out while you're on vacation etc..
At least there is major competition between a lot of cloud providers nowadays so nobody can get away with insane prices anymore. Though, would be cool to see some kind of standardised price comparision metric for medium/high complexity cloud setups. Sort of how you compare grocery prices, you have a standard purchase list.
> Well that 24/7 db server is not going to back up itself and maintain itself when the HW fails or the power goes out.
Uh, yeah, that's why you pay IT people to do that sort of thing. Hiring your own will almost certainly be less expensive than paying Amazon's people, and provides you more control and more options in the event of any problems. It's not like AWS never has problems, and when it does all you can do is twiddle your thumbs until someone else fixes it.
Once you hit that point, AWS will almost always cut you some specific pricing deals to ensure that your pain point is competitively priced, since they want you in the ecosystem anyways.
Yep that. Lambda is a massive win for me personally. I have some scraping and processing stuff that runs daily. Costs me $0.60 a month to run it even outside of free tier which is less than a cheap DO or linode box and I don’t have to look after the OS.
Same. Powershell script that collects links from Pinboard and posts them to my blog. It would be massively overkill to run a whole server for that. (And Microsoft charges me about £0.15 per month)
I can confirm that: cloud helps to evade the incompetent sales and infrastructure teams in many companies. Saving money never works once your product scales out.
> Shifting any non trivial infrastructure into AWS verbatim is always more expensive than running it yourself.
The free tier of AWS lambdas has enough room to do non-trivial applications for free, and in EC2 we can get t2.micro and t3.micro instances (2vCPU, 1GB RAM) with 750h/month for free, which pretty much means you can have the instance running the whole month for free.
Depending on what you need to do, in the very least it's possible to run a system (or parts of it ) for free, which is hard to beat.
Having said this, allowing a system architect to go nuts with AWS without being mindful of its cost is something that easily gets far too expensive far too fast. If all anyone wants is EC2 and there's no need for global deployments then you'd be better off going with cloud providers such as Hetzner. A couple of minutes with a calculator and a napkin at hand is enough to arrive at the conclusion that AWS makes absolutely no sense, cost-wise.
Wouldn‘t it make sense booki g typical SaaS on AWS marketplace? I mean you wouldn‘t have to talk to the billing department, just activate a SaaS within AWS and everything is put on your normal AWS bill?
I've made it a habit to absolutely avoid any and all AWS services for any side projects, unless it's on the employer's dime. I'd rather pay a bit more per month for a flat-fee Digital Ocean droplet. Maybe I'll end up paying a few dollars more than I would with the equivalent AWS setup, but I'll rest easy knowing I won't get a surprise bill thanks to the opaque and byzantine billing. I mean, there are consultancies whose entire premise is expertise on AWS billing, so the chance of AWS newbie-me running up many thousands because I forgot to switch off service A or had the wrong setting for service B is non-zero.
And the general advice is "don't worry, call their customer support and they'll refund you". Um, seriously? If I want to spend a morning on hold to deal with a huge unplanned bill I'll call my local tax office, thank you.
Which sucks as I learn best by building things in my spare time, but AWS makes that learning process a bit more stressful than I'd prefer.
Pretty much summarises my decision to use Linode, at a small company AWS presents a bigger monetary risk and drain on precious developer time and mental overhead than relatively small savings it might return at smaller scales...
I also actually like Linode as a company and enjoy using their services and management interface; Amazon is challenging to be positive about.
Also use Linode, they're great. Their docs are a treasure
Seems absolutely insane to use AWS for small personal/learning projects (unless the goal is to learn AWS for career purposes, I guess). It'd be like using Unreal Engine to make your 2d indie game
Always use the smallest and simplest solution that'll do the job. Simple solutions are not just as good for simple jobs...they're better
[Disclosure] I'm Co-Founder and CEO of http://vantage.sh/ and was previously the lead PM on DigitalOcean's Droplet product as well as on the product team at AWS for container services.
We try to help out a bit on this with Vantage which essentially gives you a DigitalOcean-esque view of your AWS costs. The first $2,500 in AWS costs are tracked for free which would seemingly cover your side-projects.
It sounds like you've found your home on DigitalOcean, but I'd be curious if something like Vantage would potentially change your decision to build on AWS? In particular what you mention about runaway bills is something that Vantage sends alerts on in advance. We also show you a full inventory of your AWS resources and what they cost you.
Vantage integrates at an AWS account level through a mechanism called a Cross account IAM role which allows us to ingest and process the raw data that AWS uses for its own billing systems (Cost and Usage Reports, Service APIs and Cost Explorer)
We haven't seen a single case where we don't end up matching the actual AWS bill. In fact, with a release currently in BETA and rolling out in a week or two, we'll be providing richer data faster than AWS Cost Explorer provides.
Having seen firsthand the kind of devious folks who are out there constantly trying to do fraudulent activity on AWS, I don't think any small startup like Vantage would ever want to offer a bill insurance scheme. It would be ripe for exploitation, such as someone trying to spin up 1000 instances in the last couple minutes of the month and then say "Gotcha, your bill prediction didn't match up with the real bill after all!"
At a more general level this may also be one of the most entitled asks I've ever seen in the 12+ years I've been on HN.
StratusBen, this is where you could charge an incredible premium for your service.
If you could underwrite price guarantees and show an insurance company or lender that your figures are right in most cases and that you've never lost over a certain amount, you could really hike the cost of your service and provide an incredible utility across the board.
If you have spare bandwidth and can reliably do this, try building this.
This is important, and will show confidence in your product.
The requirement is insurance against not exceeding a hard limit. We're talking about a rare event, so the offering (vantage.sh) is not good for this specific usecase if it isn't absolutely foolproof.
While building CloudAlarm [0] (supports Azure as of now), we found that the usage data on Azure wasn't available for at least a day – in fact, they keep adding the data to it during the next day so technically two days would have passed when the actual usage data is available. I haven't gone deeper in AWS but they also gradually make the data available as per I've read. SO instantly alert we thought cannot be possible with usage and hence we chose a novel route – of 'New Resource' alarm – wherein you can get alerted for all resources created or for anything expensive than the tier you choose. The resources in your Azure subscription are available almost instantly via the API, so this was something a nice workaround we thought.
I don't get why you're so stressed out by AWS billing.
From my experience, once you've worked a bit seriously with AWS, billing is not a blackbox anymore and you're able to plan ahead without too much surprise.
If you're still worried, there's also the option of settings alerts on budget spent and forecast of budget, which should settle the debate. (these are also part of the API, so you can deploy and configure these alerts through terraform)
I know experienced people who have woken up to several thousand dollar AWS bill they didn’t expect. And the large cloud providers have clearly indicated by their actions that they’re simply not interested in implementing hard cost circuit breakers.
I use AWS very lightly but I totally understand why someone wouldn’t.
> the large cloud providers have clearly indicated by their actions that they’re simply not interested in implementing hard cost circuit breakers.
I agree, my term for this is “bad faith”.
I recently had a free $200 credit for Azure. I setup their default MariaDB instance for a side project, figuring I’d get my feet wet with Azure. I didn’t spend time evaluating the cost bc I figured, how much could the default be if I haven’t cranked up the instance resources at all? Turns out the answer is more than $10/day which I discovered when authentication failed to my test DB. Back to Digital Ocean.
My term for it is “you’re not their use case.” For better or worse, they’ve prioritized usages that would much rather have an unexpected few thousand dollar bill than have services paused or shutdown unexpectedly.
But computers can behave differently based on user choice. Right? So there could be a user option to cut service beyond a fixed spend. It wouldn't be hard to implement, and tons of people would use it. They don't do it.
It's not a tragic case of priority and limited engineering resources. They like surprise bills, just like hospitals do.
Businesspeople love it when you come to their service and click through their Russian novel of a service agreement that would take a team of lawyers to parse. Once you do that, your money belongs to them! It's their court, their rules! They love it!
> It wouldn't be hard to implement, and tons of people would use it. They don't do it.
Please describe to me, in detail, how this works.
Because every time this comes up everyone claims it's the easiest thing in the world, but if you try and drill into it what they end up actually wanting is generally "pay what you want" cloud services.
There are a _ton_ of resources on AWS that accrue on-going costs with no way to turn them off. A "hard circuit breaker" that brings your newly accruing charges to zero needs to not just shut down your EC2 instances, but delete your EBS volumes, empty your S3 buckets, delete your encryption keys, delete your DNS zones, stop all your DB instances and delete all snapshots and backups, etc, etc.
The only people I see using a feature like this are some individuals doing some basic proof-of-concept work and... a bunch of people that are going to turn it on not understanding the implications and then when they get a burst of traffic that wipes out their AWS account they're going to publish angry blog posts about how AWS killed their startup.
If, like most people, you don't want literally everything to disappear the first time your site gets an unexpected traffic spike, you can already do this by setting up a response tailored to your workload--run a lambda in response to billing alerts that shuts down VMs, or stops your RDS instance but leaves the storage, etc.
> Because every time this comes up everyone claims it's the easiest thing in the world, but if you try and drill into it what they end up actually wanting is generally "pay what you want" cloud services.
Why is it on any (usually a relatively new) user to define how an entire cloud should behave?
Users are asking for a feature that helps them stop accidentally spending more than they intended. This feature request is totally fair. Implementing such a feature would be an act of good faith towards new/onboarding users (also obviously just any user with a very specific budget use-case).
> The only people I see using a feature like this are some individuals doing some basic proof-of-concept work and...
Yes exactly. GCP offers sandboxed accounts for this exact purpose. Why is this such a far reach?
> setting up a response tailored to your workload--run a lambda in response to billing alerts that shuts down VMs, or stops your RDS instance but leaves the storage, etc.
If you're telling every individual user that falls into a specific category to build a specific set of infrastructure, why is it not acceptable to you to just ask AWS to build it?
I think the sandbox idea is a great one. They should just do away with the free tier entirely except for sandbox accounts in which everything just gets shut down the second you go over the free allowance. If you want to build something for real then you pay for whatever resources you use, but if you just want to tinker around and learn a few things then you can get a safe sandbox to do it in.
BUT, I think the parent's point is that such a feature would actually be quite complicated. It's not just a matter of saying "I only want to spend $X in this account per month/total" but defining exactly what you want to do in the case where you hit that limit. Shut everything down? My guess is almost nobody would want to do that. So it ends up being some complicated configuration where you have to deeply understand all of the services and their billing models in order to configure it in the first place. What are the odds that the student who accidentally spins up 100 EC2s for a school project is going to configure this tool correctly?
But I do think the sandbox would be great. Either you are a professional in which case it is your responsibility to manage your system and put in appropriate controls to prevent huge unexpected bills or you are a student (in the general sense of someone learning AWS, not necessarily just someone in school) in which case they provide a safe environment for you to experiment.
> BUT, I think the parent's point is that such a feature would actually be quite complicated.
Sure, but so is making a cloud. Putting the onus of defining a feature like this on users, only after hearing their request ("I want to control my spend"), is IMO unfair.
Not complicated as in "too hard for AWS to build" but complicated as in "really hard to use as someone trying to limit your spend on AWS." So the people most at risk of huge unexpected bills are also not going to be the people knowledgable enough to setup the billing cap correctly. So it would mostly be a feature for enterprises and most enterprises would rather just pay the extra $ rather than potentially turn off a critical system or accidentally delete some user data.
I worked at a company that spent ~$10m per month on AWS. We had a whole "cloud governance" team who built tools to identify both over and underutilized resources. But they STILL never cut any thing off automatically. The risk/reward ratio just wasn't there. You make the right call and shave $10k off a $10m bill every month, but the one time you take down a mission critical service, you give all of that back and then some.
I've been there. I shut down a bunch of what looked like idle instances doing nothing to reduce spend. 80% of which were, in fact, doing nothing. I did drop off two vms that were supporting critical infrastructure.
Everyone who had done any work on them was long gone. I had done my due diligence to identify what they could possibly be.
Still, the day of reckoning came, and we got calls of services down a week after I turned them off. I spun them back up, and they were going again without any real impact to the business.
This turned out to be a blessing as the very next week the cert these same services depended on expired and if I hadn't learned about the system by turning them off we never would have known which boxes held up those services.
Also a lesson in what happens when people leave without any documentation on where the work they did lives and how it works.
> So the people most at risk of huge unexpected bills are also not going to be the people knowledgable enough to setup the billing cap correctly
Yes, which is why AWS builds it.
> . So it would mostly be a feature for enterprises and most enterprises would rather just pay the extra $ rather than potentially turn off a critical system or accidentally delete some user data.
There are plenty of service limits which you need to request to have lifted. This is a common enough use case that there's an entire console section for requesting increases.
SES has a sandbox mode which you need to explicitly disable.
Metering works perfectly fine for the free tier across a gamut of the most popular products.
Beyond this, the platform captures the necessary information in near enough to realtime to provide an accurate picture of spend.
Yet with all of these capabilities, there is no coherent way to constrain identities or accounts to a certain $ spend on even predictable services.
This has been the case for so long that it must clearly be by design. AWS has the guard rails available but only really wants to use them to stop things that would cause them pain (eg SES, bucket limits).
They totally should and could have the ability to manually _reduce_ service limits.
They totally should and could direct users to a way to restrict regions and services users can use without needing to set things up through IAM.
They could and should have service specific guardrails where they make sense - eg for ec2 provide a sandbox where they limit eg instance types, base hourly spend, use of cpu credits. If they wanted to get crazy they could provide per-service actions to take when thresholds are exceeded (though this might be a bit niche).
They could have a _simple_ billing alarm interface (and enable it by default during provisioning) and they could eat the cost of sending smses or emails when those thresholds are met.
Yet they choose to do none of those things, even when they provide defaults for convenience (eg encryption keys)
While my org has a somewhat nuanced understanding of aws and can set things up to provide this certainty both on our/our clients' environments and to our developers, I have a team of people whose literal job it is to do stuff with AWS, we have had training, certs and other access that beginners do not and cannot have.
IMO it doesnt need to be a tap that people can turn off at $500 a month or whatever, it just needs to be a bit more of a "if you select these default options on these basic services, there's no way that you'll acrue 10k in charges overnight"
Yes in some cases the default is quite expensive – the same was there with SQL Azure (though they have changed that recently) and it had created a good amount of bill for us (though for their credit, Azure did refund in all such occasions because we didn't use the capacity at all).
However, I don't know why the alert system doesn't have an option to say "here's my budget, alert me as soon as when my daily pace is set to exceed the monthly budget" instead, you have % of budget amount consumed based alerts, like you can get email if you say 50% of my budget is consumed, which happens every month so kinds of defeat the purpose of an alert.
We ended up creating a simple solution (cloudalarm.in – in beta) that provides such budgeted pace based alert and more ways to get instant alert which isn't possible with usage based alerts.
It's not bad faith. It's 'providing the resources you signed up for'.
Does it mean you have to go into your planning with more consideration as to cost? Yeah.
But how would you feel if your start-up finally goes viral, you're having your best day ever, and then your app just stops working because someone forgot to remove a hard spend limit?
Most people would rather see their app continue running.
And what does turning off the lights look like? If your database hits your cost limit, do you stop serving requests? Delete the data? To what extent do you want 'cost protection' for resources you signed up for?
> If your database hits your cost limit, do you stop serving requests? Delete the data? To what extent do you want 'cost protection' for resources you signed up for?
Sounds like a reasonable configurable option rather than “you shouldn’t be able to choose at all”.
I am sympathetic to the concern about cost overages--I've hit them in AWS before--but given the way that developers and managers think about SaaS products (generally, not just cloud stuff), I tend to think that even if you required them to click three checkboxes and sign their name in blood, the first time you vaporized somebody's production database because they hit their overages and didn't think it would ever happen would be apocalyptic. And the second, and the third. And you're at fault, not the customer, in the public square.
By comparison, chasing off "cost conscious" (read: relentlessly cheap--and I note that in my personal life I'm one of these, no shade being thrown here) users is probably better for them overall.
Work in AWS Premium Support. This is 100% how it goes.
Take KMS keys for example. You can't outright delete a KMS master key; you have to schedule it for deletion. The shortest period you can schedule for deletion is 7 days (default 30). Once the key is deleted, all encrypted data is orphaned.
I used to run an AWS consultancy, which is how I know. ;) More than once I had a customer go "well support won't help me, how can I get my data back?". And I had to tell them "well, support isn't just not helping you for kicks, you know?".
I am sorry, I might be missing something, but I call bullshit. How much does it cost for Amazon to store several bytes that make a key? 5 cents per decade?
“Yeah, so uhm, you hit zero, so we deleted all your keys in an irrecoverable way, sorry not sorry” — is not a circuit breaker. Make all services inaccessible to public and store the data safely until customer tops up their balance. That’s how VPSes have worked forever.
I don’t argue that “cheapo” clients are worth retaining for AWS, clearly they are not. But this kind of hypocrisy really triggers me.
Edit: a helpful person below suggested I misunderstood the parent, and I now I think I did.
AWS doesn't retain anything for you unless you tell them to, and when you tell them to delete something (as in the example relayed by the person you are replying to), they delete it as best as they are able. That's part of the value proposition: when you delete the thing, it goes away. Why would they start now for clients who want their bills to be in the tens of dollars (when if you really care you can do it yourself off of billing alerts[0])?
Going to be real: you aren't "triggered", which is actually a real thing out there that you demean with this usage of the term. You're just not the target market and you're salty that it's more complex than you think it is.
Stop serving requests until the finances are rectified, delete the data 30 days after it stops. Final migration out/egress requires a small balance for that purpose.
The engineers designing and building these systems are some of the best in the world, this is relatively trivial.
> And the large cloud providers have clearly indicated by their actions that they’re simply not interested in implementing hard cost circuit breakers.
Since I enjoy tilting at windmills--how do you propose this works? Like, in detail.
Because every time I try and drill into details of this with someone, it winds up what they really want is "pay what you want" cloud services.
AWS is much, much more than just a place to run a virtual server and many resources accrue on-going costs with no way to "turn them off". When you hit your hard circuit breaker, do they delete all your EBS volumes and data in S3? Your private SSL root? Your user directory? Encryption keys? DNS zones?
The number of people that would want all of that removed when they hit their $X/mo limit is likely minuscule in comparison to the number of people that would turn this on not understanding what it really meant and then publishing angry blog posts about how Amazon killed their startup right when they got popular and traffic spiked.
e.g. "virtual machines are stopped and de-allocated. The data in your storage accounts are available as read-only."
Most control plane operations will also be blocked. It gets complicated with more complex resource types, but it gets the job done anyway.
Note that this functionality is sort of required to support pre-paid plans without allowing them to exceed specific limits, which do exist on Azure. So there's a business dependency on this functionality today, it's not a hypothetical.
Give me a page in billing listing everything that can use money, let me enable/disable them manually (this is mostly to allow restricted dev accounts) and select which, if any, should be allowed over the bill total I set. Opt-in, but yes, do let me kill every EC2 instance if I choose.
Yup. I used AWS at my last job. We had teams of people using AWS, we had fancy 3rd party tools and extraction metrics to track and report on costs. There were still PLENTY of times when I would just scratch my head "well it looks like EC2s cost an extra $1000 this month. I wonder what happened"
This is also one of the things I fear most when running a service in the cloud: A huge bill due to excessive network usage triggered for example by a search engine, web scraper etc. I consider it very unfortunate that "capped cost" has gone somewhat out of fashion, and nowadays many major cloud providers bill excess usage rather than cutting off or slowing down traffic etc.
Here is a simple Bash script that monitors outgoing eth0 traffic (once per second) and automatically shuts down the instance once it is greater than 1 TB:
#!/bin/bash
# shut down instance if outgoing traffic > 1 TB
# 1 MB
limit=$((10**6))
# 10 MB
limit=$((10**7))
# 1 GB
limit=$((10**9))
# 1 TB
limit=$((10**12))
while true
do
date
tx=$(< /sys/class/net/eth0/statistics/tx_bytes)
echo "$tx (limit: $limit)"
if (( tx > limit ))
then
echo cutting
systemctl poweroff
fi
sleep 1
done
If you save it as cutnetwork.sh in /home/admin/cutnetwork.sh, you can run it as a systemd service:
This simplistic approach may require adjustments depending on network settings and operating environment, and will not work for example if the instance is rebooted during the billing period, since that resets the counter. I would much prefer a hard-coded setting that reliable works on the instance itself, or a reliable hard billing limit that reliably turns off the service if the accumulated cost exceeds the set amount.
I say for enterprise projects because although it's cost is
reasonable for corporate projects, not probably something you can justify for most personal/private deployments.
Kind of a chicken-and-egg situation, no? Unless you're on the company dime, learning to work with AWS entails that risk. A beginner simply won't know how to configure all of these things.
I did not downvote this, but I have a comment: The risk is not "minimalized by reading the docs and proper planning". To minimize something means to reduce it to the smallest possible amount, and I do not like to take any chances whatsoever when a huge excess bill is a possible outcome of a single misconfigured setting that can only be ruled out by reading hundreds if not thousands of pages of documentation and then following the documentation without mistake to the letter.
There is a clear possible solution for reliably preventing any amount of unintended overpayment, and that would be to configure a hard billing limit that can never be exceeded, no matter what else is being configured. All services that generate additional costs would simply have to stop or be removed if the configured limit is exceeded.
That would truly minimize the risk, because any configuration error I make will then not lead to excess payment if I configure such a limit and the cloud provider respects it.
That's the point though. A beginner is going to make mistakes and should be able to learn in a safe environment. Think of a student on a tight budget at college or in a bootcamp, who has to learn AWS because it's on the curriculum.
Yep. I'm a novice at AWS. Last year I heard about RDS, and tried playing around with it.
I thought I had shut it down because I clicked some button that looked like it was the turn-off button. I let it go on for a few months, only to discover that I had been charged $1,500.
The one thing that really pissed me off was how easy it was to set up vs how hard it was to take it back down. I can't remember the details, but basically you cannot simply turn off an RDS instance only in the UI (even though you can turn it on in the UI). You have to install the SDK and perform some (seemingly complex) commands.
I tried explaining that I was a beginner, and that I made this mistake by accident, and that they could easily see that I had not actually used this instance at all or even put any data into the DB. But they wanted this huge list of things from me in order to refund it, like a super in-depth explanation of how it happened. Really shitty experience overall. So I canceled my AWS account and likely won't go back until I have a job that pays me to learn and use it.
You can ‘Stop’ a DB and ‘Terminate’ a DB through the UI. If you have deletion protection turned on, you can only stop the DB until it’s turned off, which can also be done through the UI.
You most likely stopped the DB, but the problem there is that AWS will automatically turn on the DB after 7 days. You also still get charged for storage for the time your DB is off.
Sorry to hear that though. I know it’s a really sucky situation to be in.
Ah yeah, I remember now. You could turn off the DB, but there was some other kind of scaffolding thing that I was still getting charged for. It seemed like an RDS-specific thing. They had to point me to a tutorial for turning that piece off.
> I let it go on for a few months, only to discover that I had been charged $1,500.
You get billed monthly, so really, letting it go for a few months is on you.
> The one thing that really pissed me off was how easy it was to set up vs how hard it was to take it back down
> You have to install the SDK and perform some (seemingly complex) commands.
Hu, no it's not. Really it's litterally one click to shutdown an RDS instance, and always has been.
> But they wanted this huge list of things from me in order to refund it, like a super in-depth explanation of how it happened.
I mean that makes sense to me. They did reserve and partially used these resources for you, so it's only fair that you have to go through the trouble of explaining why they would let it go. If there's no downside everyone would just reserve a bunch of resources all the time.
From your comment, it seems you didn't even bothered to answer their questions to get a refund, I would totally not hesitate to charge you if I was in AWS place.
lol exactly. I've discussed this incident before on HN and had the same kind of "well tough luck but you deserved it" responses. They seem not to understand how off-putting it is to newcomers or students, to basically say "hey well YOU made a mistake, the trillion-dollar corporation SHOULD take your money, idiot!"
>I've discussed this incident before on HN and had the same kind of "well tough luck but you deserved it" responses.
Brush it off. I've worked with AWS pros who get lost in the billing. In my last job, we had a big "hackathon" where the objective was to reduce our AWS spend. Overall, we reduced our annual bill by a couple million dollars.
Pretty stupid to just assume that I "didn't even bother" to answer their questions. I'm not going to write a novel explaining every minutia of my interactions with AWS.
> letting it go is on you
I never said it wasn't. If you pay attention, you'll see that the context of this comment is discussing the "it's on you" culture surrounding AWS and how hostile it is to newcomers.
Exactly, it's pretty common for them to shoot their own foot with a couple of hundred dollar bill, just for one tiny instance with additional options that 'you have to do'.
IMO AWS is deliberately make these things happen and reimburse it later with excuses. It's rather a strategy at this point it seems like.
I don't think it's so much a money-grabbing strategy as the problem that AWS is less a suite of unified services and more a litter of puppies fighting in a sack. With that kind of org chart it's difficult to have a unified, simple billing experience with good beginner training-wheels and on-ramps.
>I don't get why you're so stressed out by AWS billing.
Presumably because if you haven't used it and are digging deeper there's so many services with different types of billing that it can be hard to keep track.
And who among us has not left something running way after it should've been shut down accidentally..
I never have left something running like that, but someone else in the org had made a service that would spin a t1.micro for end users to experiment with one of our saas offerings. Iirc the instance would run for 24 hours maybe, and then shut down and delete itself. It didn't delete the ebs volume, though.
So about 6-9 months after I left I get an email, please delete these, I'm too busy to use the UI to delete these 8000 ebs volumes. I don't know how much that cost, even for the final bill month, but it took me an hour and a half to write and test the script, hour and a half to run it, and an hour for my time to get reqs and prepare an invoice.
I've managed AWS for corps that Have spends in the millions of dollars per year (and knowing how split brain org charts are, possibly millions per month) on AWS and other platforms. Actually I know it was in the millions per month range, two transcoder services were costing $250k/m. Obnoxious.
Until you're into EC2-Other and then you have to follow various guides to figure out most of what that means. Even then, it's black box billing that Teams even struggle to explain. I spend a ridiculous amount on egress that's nearly impossible to track.
If I'm using my own money on a personal project, I do not want "alerts". I want a maximum budget spend per X, where X is a small increment of time, like an hour or day.
Supporting hobbyist projects would absolutely lead to higher AWS adoption, at least in smaller companies.
> I struggle to understand what's so interesting in X(AWS/Azure/GCP/Alibaba/SAP/Oracle...) Cloud that it appears basically daily on HN
> I see it like +-decade old (in mainstream) wrapper/apis over managing VMs/Infra while being proprietary as hard as possible
Don't know if serious or not, but nevertheless let me try. It's the global and near infinite scale, the enormous amounts of managed services you get behind those APIs. You need a database/message queue/object storage whatever at whatever scale? Have at it, and pay as you go. If you can't see the interest in that, i wonder what it is that you do.
And IMHO there's nothing inherently complex about the APIs of AWS or GCP ( the only ones I've really used). They're as complex as the things they manage.
I don't think we are even near the peak "Cloud" before it can slow down. We are just starting to see now entire dev envs in the cloud, and I'm pretty sure we will continue to go in that direction.
The original, main value proposition was that you don't need to physically manage your server anymore. Some products are just wrapped, managed versions of something you're already familiar with. Then there are more "original" offerings like DynamoDB, Bigquery, and Bigtable that bring a lot of value on their own and are significantly easier to operate at large scale than any open source equivalent.
We are talking about things we re using or want to use. And the usage of cloud is not going to go down anytime soon.
For example, I think AWS spend is likely the biggest spend in my company after the payroll/offices, so it is pretty important topic, business-wise.
And unlike JS frameworks which only matter to a subset of JS frontend developers, everyone can use cloud: JS or Java or Rust or C++ or C; frontent, backend, data science, ML, compilers, embedded.
I feel you. I do take the risk – the leverage on automation and manageability that Terraform e.a. give me are just too good to pass, and only with a 100% API approach like AWS provides it can I play the 100% infrastructure-as-code game, and I simply won't play any other game anymore.
Through the very same means, first thing I do with every new AWS-based project is setting up a cleanly organized Org with centralized billing, centralized IAM&Roles, centralized billing alarms, centralized SCP limitations(!!!) (as in “I will never run anything in Southeast Asia, so I disallow anything in Southeast Asia for all Org accounts), and very not-centralized resources per stage/subproject/vertical/whatever.
Plus sensible service limits on everything that has a service limit (request on API gateway etc.).
But as someone here said: your risk will remain > 0, you just have to accept that.
DO doesn’t offer accounting in services (who accessed what and when), no IAM which is a huge problem and their APIs and services are a bit low on reliability. Spaces has extremely low rate limit and their communication API time outs often. The k8s service works overall but has some annoying hiccups that only support tickets will fix, the CDN returns random 503 errors. All droplets are shared and resource contention is a thing.
That said, AWS support has its own issues, rarely solving the problem even when we pay for a TAM. Services like elasticache are hard to upgrade with zero downtime. Their solutions always involved spending inordinate amounts of money on open source clones with 1/10 of features and their good services (DynamoDB) will cost an arm and a leg.
From what I can see, it's not a matter of bang for your buck, what matters is that AWS scales lower than a DO, so if your are not fully using your VPS, AWS is cheaper.
Of course, I side with the GP here, it's just not worth the risk. I could save a bit by switching my VPS too, but I won't.
With employers I almost always use AWS, for side projects I almost always use hetzner for cheap servers. I don't even think you need to worry much about learning AWS unless you need it but if you do you can limit your budget, set alerts and hope they go off.
I'd be happy to never touch AWS unless a) I'm not the one paying for it and b) there is a genuine need. Unfortunately it's increasingly a job requirement.
IMO, AWS isn’t competing on cheaper-in-dollars-per-byte but rather in faster and cheaper for your engineering team. If your engineering team is free (as you might decide your time is on a hobby/side project), it’s harder to make the case (I still run my side projects there though), but when they can make the ops team half wearing AWS badges, that offsets a lot of lone-item markups.
I really wonder how true that is. Sure, for things like S3 or RDS it’s indeed easier, but for most other things I find AWS either very limiting or extremely arcane.
Even “simple” things like Lambda underdeliver, just this week we run into problems using it with VPC, for example. ElasticBeanstalk was another one, fine-ish for simplistic things but problematic with the smallest customization, also lots of undocumented and undebuggable quirks, like breaking if you use UTF-8 characters in your commit messages, for example.
Of course, we now have the problem where some people, both seniors and juniors, only know or only ever worked with AWS, which makes the assertion that it is “faster and cheaper” correct, but is worrisome, as lots of people are not being taught what used to be the basics 10 or 20 years ago.
I recently switched jobs. At my old company the dev team basically had carte blanche to setup their own AWS infra without any real restrictions (there were some but very few). It was nice, I almost never had to ask anybody outside my team to do anything to unblock us (at least from an infra standpoint).
In my new job we also use AWS for everything BUT I haven't used the AWS cli once. I don't even have credentials. Basically, their is a platform team which is responsible for running a k8s cluster (or clusters really) and some other common infrastructure which is all running on bare EC2 (no RDS, no EKS, we do use S3 for blob storage but that's about it). And I have to say it is pretty amazing. No more dealing with arcane IAM rules, or trying to figure out how to string together a chain of lambdas to do some sort of complex orchestration task).
I've really come to appreciate the model of using k8s as your "cloud platform." Having a dedicated team that manages the k8s clusters and makes sure they are elastically scalable and reliable. Everyone else is just deploying stuff to kubernetes. They could decide to move everything to a colo tomorrow and I would have to change exactly nothing about how I do my job.
Ah, that’s the issue, at least for me - IAM roles are pretty nuanced, and it’s difficult to understand all that’s happening.
I’m working extensively with EventBridge now, and their “security” docs mix “what can access EventBridge” with “what EventBridge can access”. Also, different AWS services all seem to have different requirements - e.g, some are role-based, some are service-based, and some are resource-based. It gets complicated very quickly.
Lambda had some gotchas around the warm-up time the last time I used it to implement something. We had to have some extra workarounds to prevent the functions from going "cold".
Don't get me wrong, I'm a Lambda fan. But unlike EC2/EBS/etc, talking to some AWS services from a Lambda that's inside a VPC requires additional infra and you have to pay for the egress. In the end it just wasn't worth the price for us. It was a bad surprise money-wise.
> If your engineering team is free (..), it’s harder to make the case (...), but when they can make the ops team half wearing AWS badges, that offsets a lot of lone-item markups.
I have to call bullshit on this claim.
Let's look at the facts. With AWS there are only two scenarios: either you go with the classic "VMs provided by a cloud provider" which is represented by EC2, or you go with hosted services and higher level abstractions like AWS's serverless offerings.
Regarding the EC2, AWS offers absolutely no operational advantage over any other cloud provider, at the expense of being far more expensive. Also, CloudFormation/CDK is arguably far worse and outright developer-hostile than any configuration-as-code alternative. This comparison makes even less sense if we look into AWS' containerization offering, which is either half-baked (ECS) or an afterthought that lags behind alternatives (EKS).
Then we have the higher level abstractions of AWS' managed services and serverless options. Price-gouging runs rampant on this domain, and arguably demands much more training and man-hours to become effective at running production services when compared with just running your own services. This scenario entails higher costs and the only arguments that any ops team can muster revolve around sunk cost and vendor lock-in.
The price gouging services make sense if they avoid you having to hire additional employees. That's the benefit of any managed service provider: they can run it more efficiently than you can (once you add all the people, supervisors of those people, people to cover when those people are on vacation, etc).
It's a way to shift people costs to IT operational expenses. If you don't do that, it's more expensive. If you do, it can easily be less expensive. I'm pretty sure we're at the point where it's less expensive because developers are waiting minutes rather than weeks [or more] for TechOps actions to happen (we were on-prem previously [and I ran TechOps]). That saves time and changes the way you think about TechOps changes. If they're lengthy, you make choices that avoid changes in TechOps. If they're fast, you make choices that make the most sense for the product and customer.
> The price gouging services make sense if they avoid you having to hire additional employees.
But the fact is that it doesn't. It's another service that needs training/experience to develop and operate. Arguing about these hypothetical savings is just a veiled appeal to the sunk cost fallacy and vendor lock-in I've mentioned.
> I'm pretty sure we're at the point where it's less expensive because developers are waiting minutes rather than weeks [or more] for TechOps actions to happen (we were on-prem previously [and I ran TechOps]).
I'm not sure this scenario is remotely realistic for the past decade or so, specially after the inception of containerization. Even in bare metal deployments anyone can get multiple databases configured and going in a matter of minutes.
Containers don’t get you more host machines racked or more disk shelves added to the SAN. On-prem, it solves configuration within a (nearly) fixed scale which, if that’s your only problem, is great. If you’re in your own DC/colo, there’s more advantages to moving out than containers can provide alone.
I literally can’t afford to invest to the level that AWS can to run operations. AWS bandwidth is incredibly expensive right up until the point where you or a neighbor is getting a DDoS attack that Amazon just “handles” for you. My customers don’t care where we’re hosted and won’t pay extra for either on-prem or cloud-hosted. They just want it to be up and transparent. For us, AWS is cheaper/faster all-in. That’s not true for everyone and, if it’s not, please don’t use it.
> Containers don’t get you more host machines racked or more disk shelves added to the SAN.
That's immaterial to the discussion, and reads like a non-sequitur. You want a service. You deploy the service in your infrastructure. If necessary, you scale your infrastructure to meet demand. That's it. If you want to spin up a database instance, just do it. With containerization that takes between minutes and seconds.
And to drive the point home, in case you're not aware, AWS is not the only cloud provider that offers horizontal autoscaling. Some small providers even sell it out of the box, both through their Kubernetes offerings and/or through their own APIs.
Also, the sales brochure for managed services mentions scalability and reliability, and in the case of AWS also global deployments, but the truth of the matter is that it costs a hefty premium and in most cases it's totally irrelevant.
So, pointing out databases in practical terms means close to nothing.
> I literally can’t afford to invest to the level that AWS can to run operations.
And that's perfectly fine because a) AWS really is not full proof (see the latest outage of AWS's US-WEST-2 region which might have single-handedly dropped AWS's reliability to only 99.5 this year), b) operating your own infra already gets you plenty of 9s easily, c) the theoretical difference in the 9s you get and the 9s that are advertised by AWS is more often than not totally irrelevant to the usecases you need to meet.
To sum it up, you may argue all you want about how AWS's Rolls Royce is far superior than any car in the market, but the truth of the matter is that the vast majority has all their needs decisively met and even surpassed by running any other cloud provider's Ford hatchback.
But it doesn't save much engineering time. You now need to manage those services, and the high mark-up means you need to invest effort in scaling things up/down.
Yes, it took me about a week to learn to set-up a Postgresql high availability cluster. But now it saves me $4,000 per MONTH for each of our 10 databases.
And if you are using EC2 instances, AWS saves very little effort compared to bare metal.
So did you send your database logs to a centralized logging system? Did you set up roles and keys to access those systems? Are your roles integrated with the rest of your permissions system? Do you have a perf dashboard where you can see the real-time usage of your DBs? Have you already rehearsed updating your database version?
The thing is that it's totally possible that you learned everything needed to set up the cluster, in my experience most database systems that aren't set up by a professional DBA will sooner or later hit a configuration or maintenance snag. Once that happens, you're pretty much totally on your own and for critical systems that downtime is going to cost you more than the costs of your infra.
So you either need to have a DBA on retainer if you're serious about data integrity, or you pay management costs which means your system was set by literally the most expert people on the planet in the area.
If you're running a cluster where performance is the highest priority and downtime and maintenance isn't a huge issue because you have a nice decent maintenance window and enough dev cycles to spend on staying up to date, for sure, go for it.
But in my experience, if you care more about a system staying up, good managed infra is so much more reliable that it's not even a question.
> So did you send your database logs to a centralized logging system?
Not needed.
> Did you set up roles and keys to access those systems?
Yes.
> Are your roles integrated with the rest of your permissions system?
Not needed. Access to the DB is limited to a very small circle, by design. Postgresql's mechanisms suffice.
> Do you have a perf dashboard where you can see the real-time usage of your DBs?
htop/task manager and Postgresql's internal monitoring have so far been sufficient.
> Have you already rehearsed updating your database version?
Indirectly yes, although AFAIK (happy to learn otherwise) major version update are also not supported by RDS.
This has been working for years now, with actual paying customers (so "in production"). It's always about opportunity-cost and diminishing returns: RDS costs about as 3 developers. The benefit the product gets from 3 more developers far outweighs the benefit of another "9" in the SLA.
Even if all of the savings would have gone to pay for a DBA, we'd still get a better deal: an in-house DBA will focus on addressing our specific needs. RDS also has a serious opportunity cost since it prohibits using extensions that are very useful for us, forcing us to spend time on workarounds.
And frankly, it's not like RDS is even that reliable. We had multiple incidents where the database stopped responding or slowed down to a crawl for several hours, and we had now way to debug this (since we cannot access the server itself). They also don't seem to handle anything about Postgresql tuning.
For the customer, the app slowing down (because RDS has a very limited IO bandwidth) is an outage (and is much more frequent than some system error).
tl;dr: Many people can meet their needs with a Ford, and living on the streets to pay for the extra features of a Ferrari is not a good tradeoff for them.
That must be some kind of “trick of the trade” because Firebase originally had feature for limiting the bill with a hard cap and they even have videos explaining how to use it however it was removed later on. Now they suggest building a script that monitors your bill and nukes the project if something happens. The catch? Billing is not real time.
I was recently forced to migrate my hobby FOSS project the other way: from DigitalOcean to AWS. The primary reason being: a generous quota of 60,000 emails per month to send via SES. Most mail providers give only up to 3,000 to 6,000 emails per month.
It's not that hard or far-fetched, given that your EC2 instance needs to have permissions to call SES and AWS can tell whether these requests come from inside or outside their network.
Nothing, it's just another layer you have to manage yourself (sidenote: the free tier EC2 instance is free for 12 months since the AWS account creation).
You'd also need to make communication secure, and use some authorization mechanism to avoid abuse and email addresses leaks (or other private information you might be passing to your service to include in the email).
> Nothing, it's just another layer you have to manage yourself
You already have to manage that layer. From your description, the only choice that was really on the table was whether AWS SES was a good enough reason to port a entire system to AWS, or if it was reasonable to just deploy the email firing service to AWS.
> You'd also need to make communication secure, and use some authorization mechanism to avoid abuse and email addresses leaks (or other private information you might be passing to your service to include in the email).
You still need to do that even if the service is running on AWS. Running the SES client in a VPC is also a very poorly researched excuse as you can put up a privatelink connection to get the same effect.
All in all, it's hard to believe that someone thought it was a good idea to port an entire system to a different cloud provider just because they wanted to send emails.
> I mean, there are consultancies whose entire premise is expertise on AWS billing, so the chance of AWS newbie-me running up many thousands because I forgot to switch off service A or had the wrong setting for service B is non-zero.
That line of reasoning is wrong. I'm sure there are consultancies that specialize in office stationery procurement; doesn't mean anything for your small use case of buying a few pens for your home office.
There's no chance I accidentally buy $10,000 worth of staplers when I walk into Office Depot though, while the opposite is extremely easy in AWS. Plus, when I checkout I'll get an itemized total of what I owe before I pay, I won't be charged an unclear amount at a future date.
It's the same with Azure. On multiple occasion, I had databases created in tiers several times expensive than the one I use with my subscription. This wasn't a manual mistake; a sleeping app got awakened (may be I'd have hit the run button by mistake) and it ended up creating the database via the ORM framework. Since the ORM framework is only executing create database on the sql server, Azure goes with the default for tier which they had chosen as one with $250 or something monthly price.
I've setup the budget alerts on Azure but these are threshold (% of budget consumed) based and they come every month so technically they aren't alert rather information which requires you to do the math whether or not you're in the budget. So you tend to ignore them after a while. Recently, we decided to build a simple solution ourselves [0] which gives alerts based on budgeted pace and not consumption.
DO does not have a good support channel. My droplet died. I couldn't reach it, and neither could the world. There was no emergency button for getting help. Just send in an email. After a small time window, I had to just delete and rebuild my droplet. Two days or so later, they got back to me and because I deleted it, they couldn't debug it. Two days to get back to me on a dead, non-reachable droplet. I don't think that is acceptable for anything running in production.
But the whole point of treating cloud servers as cattle rather than pets is that when one dies, you can spin up another one, right? Ironically, that's one reason I prefer AWS over DO and the like, because AWS's EC2 auto-scaling and availability zones are great for this kind of resilience.
Who told you to call their customer support for a refund? AWS (and other cloud vendors) practically never refund. They will give out credits for their platform but that won't help you much as a private individual who just lost hundreds of dollars.
> Who told you to call their customer support for a refund? AWS (and other cloud vendors) practically never refund.
I've gotten refunds about a half a dozen times now. Every time I've asked. One was for over a hundred thousand dollars.
I've never paid for support, I have no contacts within the company (like are often necessary at places like Google), I literally just put in a support ticket asking for a refund and got a refund.
They usually require some documentation/explanation of how you're going to avoid making the same mistake again (which is fair), but otherwise have been very cooperative.
That’s kind of a meme in HN and Reddit: there were a few public occasions where users were refunded and people now just assume AWS will also refund for every instance.
I've never had a refund request rejected myself, and I've made multiple mistakes over multiple accounts. Even things such as "Hey, I forgot to turn off this ec2, i wanted to destroy it, any chance for a refund?"
I would avoid DO and similar McDonald's-type cloud providers for anything production. Commercial or private.
AWS might make it difficult to figure out the cost (most common complaint) but the services are professional grade and their support is as well. DO on the other hand provided me an instance with an IP that was on a public blacklist and banned my account within 5min of spawning an instance with the explanation that "it was compromised and hacking" failing to accept that they provided me with the OS image and the public IP. Took me two months of arguing to get the account unblocked and balance withdrawn. Lesson learned; you get what you pay for. Back to AWS for me.
I nearly made myself a very nice footgun not long since.
So MediaConvert (video transcoding), direct s3 upload to s3 bucket, bucket fires event to my application, my application builds the job and submits it to media convert with the output bucket as the destination.
Straight forward enough, unless you happen to be copying a config tired and put your input/output buckets as the same bucket...
Fortunately previous-me was paranoid enough to have put in an if check and die if they where the same but otherwise that could have cost a lot of money.
Nothing for me compares to the time I purchased 2 reserved EC2 instances for about $5K on my personal account rather than companies. I can still remember that sinking feeling as I realized what I'd done.
It’s incredibly easy to spend a lot of money on the cloud, indeed. I remember using Google Cloud’s translate API on a bunch of documents — it took several hours for the bill to pop up at $1500. This was a hobby / personal project of mine, Google did not refund it, because of course I should have read the pricing more carefully.
This is the advantage with AWS, they _will_ refund mistakes. I've seen it happen twice, and both times were resolved quickly with a rep on the phone.
I've also once had an issue with my own personal account. Five minutes with a rep on the phone saved not my bank account, but my website and hosted services, because my credit card was cancelled and it would be another few months before I could get another.
Amazon as a company tends to side with the customer. Their whole mantra is that it's not worth chasing after x* amount of dollars. Now repeat offenses? No, you're not getting away with using their services for free. (You mention twice, but I imagine 4 or 5 times, and they're going to fault you without escalating the issue)
*within reason... you're not going to serve up an app all month long and skip out on a million dollar bill.
In summary: Either overprovisioning, or not realising every extra CPU cycle or I/O operation costs extra money.
This is, of course, the real way "the cloud" makes money. Carefully tuned, it can no doubt be cheaper than do-it-yourself, however, it is also quite easy to make a lot of costs.
Contrary to popular believe, the case for going to the Cloud isn't cost saving, it is flexibility and value for money. It'll likely cost you around the same if you run it in a DC, but you won't have features like auto scaling, increased security, and much more.
About the same? Last I checked for our somewhat static work load on a bunch of webservers, AWS would be x10 in pricing. Not to mention that you need someone who has deep AWS knowledge and experience to manage your system, just like you need someone who manages your dedicated servers in a DC.
It's great for workloads that fluctuate extremely, or require massive scaling in very short time. Not sure about the increased security. If you run your images on EC2, it's still up to you to not mess up the config.
We're actively planning an aws workload right now, and with reserved instances for the baseline workload, the pricing is closer to 1.5-2x, but the cost savings of only needing to scale up for a couple of hours per week make up for that. Yes it would be cheaper to run out own infra for the baseload and burst into aws, but that adds operational load onto the development team, which defeats the purpose of going with AWS in the first place
It’s also super easy to use. I have an instance hosting a game server for friends. I might be wasting money since the server sits idle about half the time though.
There was a coupon site that I moved to AWS because all the customers were in EU, but our servers were in Los Angeles. I implemented it in two t1.micro servers, db+nginx. Every night at 10PM in LA the first week I'd get a slew of notifications from monitoring that the CPUs were pegged out. I think I put squid or something on the nginx instance so that the most common stuff came out of the proxy instead of having to be php-unblinkered.
However another thing that could have been done is to use a slightly larger instance for the DB to get more ram for cache, and then use a single front end until 9PM, spin another up, and spin one down at like 1AM.
This is obviously a trivial example, but lots of sites and apps get everyone refreshing at once maybe once or twice a day, and that's when scaling helps, at least from a "costs $X per hour, so don't use it unless we need it". Loads were under 20% the other 23 hours during the day, averaged.
From the perspective of a developer, flexibility is a double-edged weapon.
Before cloud: we have database quota of a few gigabytes, and once in a few years we need to justify to management why the quota should be doubled.
After cloud: whenever we add a new table, or a new column, or import lots of data, the invoice slightly increases, the management notices, and we need to justify the extra megabytes.
My favorite billing mistake was forgetting to delete an unused elastic IP address and then realizing I was being charged $34 / month for 2 months just to have it exist while doing nothing.
Edit: It's exactly $33.62 and I was mistaken on what caused it. It came from having a NAT Gateway just idling which is $0.045 per hour x 747 hours = $33.62 on us-east-1.
I know it's not the biggest mistake ever, but these things creep up on you when you use CloudFormation and it continuously fails to delete resources so you're left having to manually trace through a bunch of resources. It's easy to leave things hanging.
unused Elastic IP pricing looks to me like $3.60/month on their pricing page. ($0.005 per hour). What am I missing to get to $34/month? (Or did you have 10 of em?)
1) Terminating instances that had ephemeral disks with stuff you needed while thinking the EBS volumes would remain
2) Leaving NAT gateways lying around or ELBs that do nothing and have no instances attached.
3) Public S3 buckets - arguably the most common one that can lead to security incidents
4) Debugging security groups/Network ACLs and straight up break networking for something without knowing it. Reverse of that would be you want to fix something quickly and open 0.0.0.0/0 to everyone and never get around to tightening up the firewall later on.
I was playing with the Azure "free" tier. Even I tried to be extremely careful with it, after a while noticed that I had left a storage blob for a VM hanging around and some external IPv4 address. I will continue to use Hetzner online for my own stuff instead running this on "public cloud".
Does the 'cpu credits' stuff apply to spot instances too? I have been thinking of shortening my animation render time with spot instances, but it only makes sense if I can run every core at 100% for the entire life of the instance.
I view AWS as a study in doing everything the "bare hands" way. Here are some examples of the old sysadmin ways of doing things vs the modern "web" way:
* regions -> self-balancing algorithms like RAFT
* roles/permissions -> tokens
* IP address filtering -> tokens
* CPU clusters -> multicore/containerization/Actor model
* S3 -> IPFS or similar content-addressable filesystems
It's not just AWS having to deal with this stuff either:
* CORS -> Subresource Integrity (SRI)
* server languages (CGI) -> Server-Side Includes (SSI)
* Javascript -> functional reactive, declarative and data-driven components within static HTML
* async -> sandbox processes, fork/join, auto-parallelization (seen mostly in vector languages but extendable to higher-level functions)
* CSS -> a formal inheritance spec (analogous to knowing set theory vs working around SQL errata)
I could go on forever but I'll stop there. We are living at a very interesting time in the evolution of the web. I think that web dev has reached the point where desktop dev was in the mid-1990s and is ripe for disruption. No disruption will come from the big companies though, so this is your chance to do it from your parents' basement!
Ok im going to admit to a mistake revolving around NAT gateways and Lambdas.
So, i basically wanted to connect a Lambda to a Postgres / RDS database, for that I had to put into a private VPC, but the lambdas still had to talk to the world (a lot) so i just put a nat gateway around it no biggy.
Well, end of the story on one day i produced 2000 Euro in cost for the Nat gateway haha
This is literally the main reason a lot of companies use AWS. In Australia, it is very hard for Government Departments to get capital expenses approved for infrastructure as it requires a lot of rigmarole.
However, once you’re in AWS its OpEx, who cares as long as you don’t break the budget too soon before EOFY.
So basically this ... «Ah, I see you have the machine that goes ping. This is my favorite. You see we lease it back from the company we sold it to and that way it comes under the monthly current budget and not the capital account.»
Are they really though? A serverless event driven architecture system I’m working on literally costs less than £10 a month on Azure. Running full blown VMs instead of cheaper more appropriate technologies like containers or functions will always cost more.
As per our calculation for CloudAlarm [0], as we reach a few hundred users, it'd be cheaper to use a dedicated instance than serverless (Azure Functions) design. So it may vary from system to system depending the amount of work you perform for each user.
0: https://cloudalarm.in/ – btw, you may wish to have daily budgeted pace based alerts using it – to inform you when the usage spikes up (much faster than Azure's consumption threshold based alerts).
But what mistakes did he make?
Did he screw up the bill? Did he fail to keep services available? I only read facts about the ins and outs of AWS' billing and credits system.
If you run out of CPU or IOPS burst balance then your system will suddenly slow to a crawl and it can easily cause downtime or in the case of background jobs it will cause long queues or never ending jobs. Learned that the hard way, a couple times.
One time I optimized DB access which fixed the IOPS usage, and then that caused more CPU usage on the app servers which caused them to run out of CPU burst... Fun times. Switched from one burst issue to another.
I disagree this is about scaling systems. This is actually more about using the wrong instance type. If this was on a typical VPS it wouldn't have ever happened. The baseline CPU level on these burst instances is so low that for any long running task using even like 40% CPU it gets throttled so hard it brings everything down. I would have been totally fine if this was a $5 DigitalOcean VPS
Burst CPU and IOPS has bitten me a couple times over the years. In fact, it’s basically the sole cause of nearly all our downtime in recent history. That’s frustrating. I get that it’s a technical solution to the problem of resource utilization at scale, but they could’ve spent some time making it easier to observe — for example, rescale the CPU or IOPS graphs so that 100% is your max sustained budget, and anything over 100% eats into your quota.
Slightly OT: I love Forge but recently I've started using it for my non-PHP projects which feels... wrong. Are there any similar services that are more agnostic?
On billing.. they will never do it, but on smaller accounts they could build trust by offering some sort of "prepaid" mode like cell phone services do at the low end.
That is - you deposit $X in your account, and AWS nukes your live services if you breach it. The worst that ever happens is you are out sunk cost of the $X you had already deposited.
"Technically they are a smidgen slower than Intel for certain workloads."
In my experience, after migrating several servers with quite varying workloads, they're faster than Intel - and more than a smidgen. Just as is the general case with current AMD Ryzen vs Intel.
[Disclosure] I'm Co-Founder and CEO of http://vantage.sh/, a cloud cost platform for AWS. Previously I was a product manager at AWS and DigitalOcean.
Since the author and so many people are commenting about AWS costs (and in particular, choosing cheaper EC2 instances and EBS volumes) I thought I'd mention that Vantage has recommendations that look to tell you for these exact things so you don't get tripped up / spend more than you have to.
If you have "antiquated" EC2 instances or EBS volumes, Vantage will give you a recommendation for which instance to switch to and how much money you'll save.
The first $2,500/month in AWS costs are also tracked for free so people get a lot of value out of the free tier and can save significant parts of their bills when developing on AWS.
Respect that you are all about that grindset for your product in this thread, but it's also a little insane that you need a third party tool to make sense of what's going on in AWS.
I'm a bit of a GCP fan, and while it's billing is also arcane, it think it is just a little bit easier to understand and better laid out. For bread and butter stuff like regular VPSs though, AWS is often a little cheaper. But GCPs other cloud offerings are occasionally very respectably priced.
every $X00 billion dollar business is big enough that third party tools will always be desired because the default experience wont be good enough for some part of the market.
question is whether or not that part is big enough to warrant its own venture scale business, as with Vantage :)
On a price sensitive project I almost exclusively used spot instances at a dramatically reduced price over on-demand. It forced me to built high availability elements into the design at the outside, though ultimately spot instances got shut down no less frequently than my experience with on demand maintenance and individual machine outages.
Obviously mileage will vary, but going in I was under the impression that spot instances were on the knife's edge, when with a decent pricing strategy they're as robust as on demand at a fraction of the cost.
Spot instances for GPU are shutdown within hours.
As frequently touted in favor of AWS, engineer time is the most important thing. The time to adapt the code to frequent failure, and the delays in getting the results, costs money as well, negating the financial saving from spot instances.
Designing to remain robust in the face of failures is compulsory for any project of any significance. Or at least it should be, though a lot of projects go on a wing and a prayer that nothing will go awry and "save" those engineering hours until a catastrophe at some future point. It basically just prioritized what already should be a priority.
I have no doubt that fringe/niche instances have more competitive spot behaviors, though how you set your bid range dramatically impacts how you survive through competition, but I had vanilla instances last for literal years (note that by default the spot requisition has a lifespan of one year so you have to modify that) at per hour pricing somewhere in the range of 1/5th on demand.
But mileage will vary.
I don't use those spot instances anymore as my projects are much better financed now, and I have significant compute on other platforms including bare metal in colocation facilities. However when I did I stayed silent about it, feeling almost like it was a secret that would be ruined if others knew about it.
There it is - the hidden cost of AWS. For bare-metal the risk of hardware failure is so low that it's faster to just handle the interruption when it happens (e.g. just restart the process) than to implement interruption tolerance. The hardware fails only once in many years. The chances of that happening during the 24 hours we train a model are almost zero. On a spot instances, the risk of the same are almost 100%, requiring investment up-front.
For the price of a spot instance we can get an always-on bare-metal server without having to worry for how long it will remain available.
We use GCPs equivalent of spot instances (preemptibles) to great effect as well. It actually works better at larger scale since a smaller % of your machines get preempted at a given time.
If you want to talk systemic AWS mistakes you can make, we accidentally created an infinite event loop between two Lambdas. Racked up a several-hundred-thousand dollar bill in a couple of hours. You can accidentally create this issue across lots of different AWS services if you don't verify you haven't created any loops between resources and don't configure scaling limitations where available. "Infinite" scaling is great until you do it when you didn't mean to.
That being said, I think AWS (can't speak for other big providers) does offer a lot of value compared to bare-metal and self-hosting. Their paradigms for things like VPCs, load balancing, and permissions management are something you end up recreating in most every project anyways, so might as well railroad that configuration process. I've experienced how painful companies that tried to run their own infrastructure made things like DB backups and upgrades that it would be hard to go back to a non-managed DB service like RDS for anything other than a personal project.
After so many years using AWS at work, I'd never consider anything besides Fargate or Lambda for compute solutions, except maybe Batch if you can't fit scheduled processes into Lambda's time/resource limitations. If you're just going to run VMs on EC2, you're better off with other providers that focus on simple VM hosting.