The cloud is a psyop, a scam. Except at the tiniest free-tier / near free-tier use cases, or true scale to zero setups.
I've helped a startup with 2.5M revenue reduce their cloud spend from close to 2M/yr to below 1M/yr. They could have reached 250k/yr renting bare-metal servers. Probably 100k/yr in colos by spending 250k once on hardware. They had the staff to do it but the CEO was too scared.
Cloud evangelism (is it advocacy now?) messed up the minds of swaths of software engineers. Suddenly costs didn't matter and scaling was the answer to poor designs. Sizing your resource requirements became a lost art, and getting into reaction mode became law.
Welcome to "move fast and get out of business", all enabled by cloud architecture blogs that recommend tight integration with vendor lock-in mechanisms.
Use the cloud to move fast, but stick to cloud-agnostic tooling so that it doesn't suck you in forever.
I've seen how much cloud vendors are willing to spend to get business. That's when you realize just how massive their margins are.
> Suddenly costs didn't matter and scaling was the answer to poor designs.
It did.
Did you know that cloud cost less than what the internal IT team at a company would charge you?
Let's say you worked on product A for a company and needed additional VM. Besides paperwork, the cost to you (for your cost center) would be more than using the company credit card for the cloud.
> Sizing your resource requirements became a lost art
In what way? We used to size for 2-4x since getting additional resources (for the in-house team) would be weeks to months. Same old - just cloud edition.
> Did you know that cloud cost less than what the internal IT team at a company would charge you?
Yes. Internal IT teams ran old-school are inefficient. And that's what the vendor tells you while they create shadow IT inside your company. Skip ITSM and ITIL... do it the SRE way.
Until the cloud economist (real role) comes in and finds a way to extract more rent out of their customer base (like GCP's upcoming doubling rates on CDN Interconnect). And until internal IT kills shadow IT and regains management of cloud deployments. Cybersecurity and stuff...
Back to square one. ITIL with cloud deployments. Some use cases will be way cheaper... but for your 100s of PBs of enterprise data, that's another story. And data gravity will kill many initiatives just based on bit movement costs.
> Besides paperwork, the cost to you (for your cost center) would be more than using the company credit card for the cloud.
To some extent. One is hard dollars the other is funny money. But I thought paying for cloud with the company credit card was a 2016 thing. Now it's paid through your internal IT cost center, with internal IT markup.
I've seen petabytes of data move to the cloud and then we couldn't perform some queries on it anymore as that store wouldn't support it, and we'd need to spend 7 figures to move to another cloud database to query it. And that's hard dollars.
Yes, during early cloud days it was lean and aimed at startups. Now it's aimed at enterprise, and for some reason lots of startups still think it's optimized for them. It's not and it hasn't been for a long time.
> Yes. Internal IT teams ran old-school are inefficient.
They aren't. It's politics. They want to protect and improve their own headcount and resources.
> One is hard dollars the other is funny money.
All the same to a team / department. It's not like people run it like their own wallet.
> finds a way to extract more rent out of their customer base
I find you just have a grudge against the cloud and hence too young. For every example you have the so-called "internal" IT team can and will do just the same. Go back to 90s, 00s - it was the same. The infra team wanted some fancy new storage arrays and charge everyone 2x for the new service etc.
> and for some reason lots of startups still think it's optimized for them. It's not and it hasn't been for a long time.
The problem isn't the cloud. Startups have always worked like this even 10-20 years ago. It's about wastage. They can raise and grow faster. So they think. The problem, if any is recently money isn't as cheap. Nothing new.
> the so-called "internal" IT team can and will do just the same.
but how is shadow IT gonna solve anything? it'll get kudos from the junior VP, chuckles from the SVP, but the CIO will laugh you out of the room at how poor you are at getting shit done internally.
> The problem isn't the cloud.
The problem is how gullible folks are at cloud advocacy, or any vendor advocacy in general. It's all lies but cloud lies are better than others! Your 3-year commitment won't scale down to low figures. Oh you wanna have that many nodes come black Friday? Gotta reserve! Yup, infinite scale actually means infinite lies.
Above all, the cloud is not cheap. 11B profit on 33B revenue per quarter at AWS. If your local IT spend is inefficient, I bet it won't be more efficient in the cloud.
Thanks. I unsubscribed when I busted my weekly limit in a few hours on the Max 20x plan when I had to use Opus over Sonnet. It really feels like they were off by an order of magnitude at some point when limits were introduced.
You get visibility into your usage, and you're seeing if you're exceeding the usage. They recommend to use plans if your typical traffic is 'only' up to 50TB per month. Occasional spikes are fine from what I understand.
This isn’t true anymore we are way beyond 2014 Hadoop (what the blog post is about) at this point.
Go try doing an aggregation of 650gb of json data using normal CLI tools vs duckdb or clickhouse. These tools are pipelining and parallelizing in a way that isn’t easy to do with just GNU Parallel (trust me, I’ve tried).
What if it was 650TB? This article is obviously a microbenchmark. I work with much larger datasets, and neither awk nor DBD would make a difference to the overall architecture. You need a data catalog, and you need a clusters of jobs at scale, regardless of a data format library, or libraries.
1. Assume date is 8 bytes
2. Assume 64bit counters
So for each date in the dataset we need 16 bytes to accumulate the result.
That's ~180 years worth of daily post counts per gb ram - but the dataset in the post was just 1 year.
This problem should be mostly network limited in the OP's context, decompressing snappy compressed parquet should be circa 1gb/sec. The "work" of parsing a string to a date and accumulating isn't expensive compared to snappy decompression.
I don't have a handle on the 33% longer runtime difference between duckdb and polars here.
I think the entire point of the article (reading forward a bit through the linked redshift files posts) is that almost nobody in the world uses datasets bigger than 100Tb, that when they do, they use a small subset anyway, and that 650Gb is a pretty reasonable approximation of the entire dataset most companies are even working with. Certainly in my experience as a data engineer, they're not often in the many terabytes. It's good to know that OOTB duckdb can replace snowflake et all in these situations, especially with how expensive they are.
> It's good to know that OOTB duckdb can replace snowflake et all in these situations, especially with how expensive they are.
Does this article demonstrate that though? I get, and agree, that a lot of people are using "big data" tools for datasets that are way too small to require it. But this article consists of exactly one very simple aggregation query. And even then it takes 16m to run (in the best case). As others have mentioned the long execution time is almost certainly dominated by IO because of limited network bandwidth, but network bandwidth is one of the resources you get more of in a distributed computing environment.
But my bigger issue is just that real analytical queries are often quite a bit more complicated than a simple count by timestamp. As soon as you start adding non-trivial compute to query, or multiple joins (and g*d forbid you have a nested-loop join in there somewhere), or sorting then the single node execution time is going to explode.
I completely agree, real world queries are complicated joins, aggregations, staged intermediary datasets, and further manipulations. Even if you start with a single coherent 650gb dataset, if you have a downstream product based on that, you will have multiple copies and iterations, which also have the reproducible, tracked in source control, and visualized in other tools in real time. Honestly, yes, parquet and duckdb make all this easier than awk. But, they still need to be integrated into a larger system.
I've seen how the cloud changed over 20 years from s3/ec2 in 2006 to what we have today. I've also seen how it's built at AWS. It's ironic they call it utility computing.
What I always feared as a user was that they'd invent a new billable metric, which happened a few times. Have you ever seen a utility add them at this pace? The length of your monthly usage report shows all those items at $0 that could eventually be charged. Let that sink in.
Another interesting element is that all higher level services are built on core services such as s3/ec2. So the vendor lock in comes from all propaganda that cloud advocates have conditioned young developers with.
Notice how core utilities in many countries are state monopolies. If you want it to be a true utility, perhaps that's the solution to get them started. The state doesn't need a huge profit, but it needs sovereignty and keep foreign agents out of its DCs. Is it inefficient? Of course. But if all you really need is s3/ec2 and some networking / queuing constructs, perhaps private companies can own higher tier / lock-in services while guaranteeing it runs on such a utility. This would provide their users reduced egress fees from a true utility which doesn't need (and is not allowed to have) a 50x profit on that line item.