Hacker News new | past | comments | ask | show | jobs | submit login
Reclaim the Stack (reclaim-the-stack.com)
496 points by dustedcodes 79 days ago | hide | past | favorite | 321 comments



I’ve been building and deploying thousands of stacks on first Docker, then Mesos, then Swarm and now k8s. If I have learned one thing from it, it’s this: it’s all about the second day.

There are so many tools that make it easy to build and deploy apps to your servers (with or without containers) and all of them showcase how easy it is to go from a cloud account to a fully deploy app.

While their claims are true, what they don’t talk about is how to maintain the stack, after “reclaiming” it. Version changes, breaking changes, dependency changes and missing dependencies, disaster recovery plans, backups and restores, major shifts in requirements all add up to a large portion of your time.

If you have that kind of team, budget or problem that deserves those, then more power to you.


> If you have that kind of team, budget or problem that deserves those, then more power to you.

This is the operative issue, and it drives me crazy. Companies that can afford to deploy thousands of services in the cloud definitely have the resources to develop in-house talent for hosting all of that on-prem, and saving millions per year. However, middle management in the Fortune 500 has been indoctrinated by the religion that you take your advice from consultants and push everything to third parties so that 1) you build your "kingdom" with terribly wasteful budget, and 2) you can never be blamed if something goes wrong.

As a perfect example, in my Fortune 250, we have created a whole new department to figure out what we can do with AI. Rather than spend any effort to develop in-house expertise with a new technology that MANY of us recognize could revolutionize our engineering workflow... we're buying Palatir's GenAI product, and using it to... optimize plant safety. Whatever you know about AI, it's fundamentally based on statistics, and I simply can't imagine a worse application than trying to find patterns in data that BY DEFINITION is all outliers. I literally can't even.

You smack your forehead, and wonder why the people at the top, making millions in TC, can't understand such basic things, but after years of seeing these kinds of short-sighted, wasteful, foolish decisions, you begin to understand that improving the company's abilities, and making it competitive for the future is not the point. What is the point "is an exercise left to the reader."


> we have created a whole new department to figure out what we can do with AI.

Wow, this is literally the solution in search of a problem.


This is absolutely true. I can count easily some 20+ components already.

So this is not walk in the park with two willing developers to learn k8s.

The underlying apps (Redis, ES) will have version upgrades.

Their respective operators themselves would have version upgrades.

Essential networking fabric (calico, funnel and such) would have upgrades.

The underlying kubernetes itself would have version upgrades.

The Talos Linux itself might need upgrades.

Of all the above, any single upgrade might lead to infamous controller crash loop where pod starts and dies with little to no indication as to why? And that too no ordinary pod but a crucial pod part of some operator supposed to do the housekeeping for you.

k8s is invented at Google and is more suitable in ZIRP world where money is cheap and to change the logo, you have seven designers on payroll discussing for eight months how nine different tones of brand coloring might convey ten different subliminal messages.


> The underlying apps (Redis, ES) will have version upgrades.

You would have to deal with those with or without k8s. I would argue that without it is much more painful.

> Their respective operators themselves would have version upgrades. > > Essential networking fabric (calico, funnel and such) would have upgrades. > > The underlying kubernetes itself would have version upgrades. > > The Talos Linux itself might need upgrades.

How is this different from regular system upgrades you would have to do without k8s?

K8s does add layers on top that you also have to manage, but it solves a bunch of problems in return that you would have to solve by yourself one way or another.

That essential networking fabric gives you a service mesh for free, that allows you to easily deploy, scale, load balance and manage traffic across your entire infrastructure. Building that yourself would take many person-hours and large teams to maintain, whereas k8s allows you to run this with a fraction of the effort and much smaller teams in comparison.

Oh, you don't need any of that? Great. But I would wager you'll find that the hodge podge solution you build and have to maintain years from now will take much more of your time and effort than if you had chosen an industry standard. By that point just switching would be a monumental effort.

> Of all the above, any single upgrade might lead to infamous controller crash loop where pod starts and dies with little to no indication as to why?

Failures and bugs are inevitable. Have you ever had to deal with a Linux kernel bug?

The modern stack is complex enough as it is, and while I'm not vouching for increasing it, if those additional components solve major problems for me, and they become an industry standard, then it would be foolish to go against the grain and reinvent each component once I have a need for it.


You seem to be misunderstanding. The components that add complexity in this case do not come from running a k8s cluster. They come from the Reclaim the Stack software.


Alright. So let's discuss how much time and effort it would take to build and maintain a Heroku replacement without k8s then.

Besides, GP's criticisms were squarely directed at k8s. For any non-trivial workloads, you will likely use operators and networking plugins. Any of these can have bugs, and will add complexity to the system. My point is that if you find any of those features valuable, then the overall cost would be much less than the alternatives.


The alternative is not to build a different PaaS alternative, but to simply pay Heroku/AWS/Google/Indie PaaS providers and go back to making your core product.


Did you read the reasons they moved away from Heroku to begin with? Clearly what you mention wasn't an option for them, and they consider this project a success.


Talos is an immutable OS; upgrades are painless and roll themselves back upon failure. Same thing for K8s under Talos (the only thing Talos does is run K8s).


TIL "immutable OS", thanks.

Ages ago, I had the notion of booting from removable read-only media. At the time CD-ROM. Like gear for casting and tabulating votes. Or controllers for critical infra.

(Of course, a device's bootloader would have to be ROM too. And boot images would be signed, both digitally and manually.)

Maybe "immutable boot" and immutable OS can be complimentary. Surely someone's already explored this (obv idea). Worth pondering.


Secure Boot can do the "ROM" scenario on conventional read-write media as long as your OS is capable of maintaining a chain of trust and enforce code signature checks before execution. The media is technically writable, but any writes would break said chain of trust on subsequent loads and so the malicious code would fail to execute.

If interacting with a remote system, a TPM can also be used to achieve the same (though if you have TPM you'd generally always have secure boot) - in this case, your OS extends TPM PCRs with the hashes of all the components in its chain of trust, and the remote system uses remote attestation to prove that you indeed booted & executed the expected code before granting you an access token which is never persisted.

In the second case, malicious code would still run but would be unable to pass that authentication step and thu be unable to communicate with the remote system. This is suitable if the machine itself is stateless and not sensitive per-se, and the only security requirement is ensuring the remote system is only accessed by trusted software.


The flip side of this is the cost. Managed cloud services make it faster to get live, but then you are left paying managed service providers for years.

I’ve always been a big cloud/managed service guy, but the costs are getting astronomical and I agree the buy vs build of the stack needs a re-evaluation.


This is the balance, right? For the vast majority of web apps et. al. the cloud costs are going to be cheaper than having full-time Ops people managing an OSS stack on VPS / Bare Metal.


And what is your take on all those things that you tried? Some experience/examples would benefit us probably.


The thing that strikes me is: okay, two "willing developers" - but they need to be actually capable, not just "willing" but "experienced and able" and that lands you at a minimum of $100k per year per engineer. That means this system has a maintenance cost of over $16K per month, if you have to dedicate two engineers full to the maintenance, and of course following the dynamic nature of K8s and all their tooling just to stay in front of all of that.


Also, for only two k8s devops engineers in a 24h-available world, you’re gonna be running them ragged with 12h solo shifts or taking the risk of not staffing overnight. Considering most update and backup jobs kick off at midnight, that’s a huge risk.

If I were putting together a minimum-viable staffing for a 24x7 available cluster with SLAs on RPO and RTO, I’d be recommending much more than two engineers. I’d probably be recommending closer to five: one senior engineer and one junior for the 8-4 shift, a engineer for the 4-12 shift, another engineer for the 12-8 shift, and another junior who straddles the evening and night shifts. For major outages, this still requires on-call time from all of the engineers, and additional staffing may be necessary to offset overtime hours. Given your metric of roughly $8k an engineer, we’d be looking at a cool $40K/month in labour just to approach four or five 9s of availability.


Even worse, this feels like the goal was actually about reclaiming their resumes, not the stack. I expect these two guys to jump ship within a year, leaving the rest of the team trying to take care of an entire ecosystem they didn't build.


And you may still end up with longer downtime if SHTF than if you use a managed provider.


Agreed. Forgive a minor digression, but what OP wrote is my problem now. I'm looking for something like heroku's or fly's release command. I have an idea how to implement it in docker using swarm, but I can't figure out how to do that on k8s. I googled it some time ago, but all the answers were hacks.

Would someone be able to recommend an approach that's not a hack, for implementing a custom release command on k8s? Downtime is fine, but this one off job needs to run before the user facing pods are available.


Look at helm charts, they have become the de facto standard for packaging/distributing/deploying/updating whole apps on Kubernetes


https://leebriggs.co.uk/blog/2019/02/07/why-are-we-templatin...

Something like Jsonnet would serve one better, I think. The only part that kinda sucks is the "package management" but that's a small price to pay to avoid the YAML insanity. Helm is fine for consuming third-party packages.


Agreed, but to be fair, those are general problems you would face with any architecture. At least with mainstream stacks you get the benefit of community support, and relying on approaches that someone else has figured out. Container-based stacks also have the benefit of homogeneizing your infrastructure, and giving you a common set of APIs and workflows to interact with.

K8s et al are not a silver bullet, but at this point they're highly stable and understood pieces of infrastructure. It's much more painful to deviate from this and build things from scratch, deluding yourself that your approach can be simpler. For trivial and experimental workloads that may be the case, but for anything that requires a bit more sophistication these tools end up saving you resources in the long run.


> it’s all about the second day

Tangentially, I think this applies to LLMs too.


Of course you reduced 90% of the cost. Most of these costs don't come from the software, but from the people and automation maintaining it.

With that cost reduction you also removed monitoring of the platform, people oncall to fix issues that appear, upgrades, continuous improvements, etc. Who/What is going to be doing that on this new platform and how much does that cost?

Now you need to maintain k8s, postgresql, elasticsearch, redis, secret managements, OSs, storage... These are complex systems that require people understanding how they internally work, how they scale and common pitfalls.

Who is going to upgrade kubernetes when they release a new version that has breaking changes? What happens when Elasticsearch decides to splitbrain and your search stops working? When the DB goes down or you need to set up replication? What is monitoring replication lag? Or even simply things like disks being close to full? What is acting on that?

I don't mean to say Heroku is fairly priced (I honestly have no idea) but this comparison is not apples to apples. You could have your team focused on your product before. Now you need people dedicated to work on this stuff.


Anything you don't know about managing these systems can be learned asking chatgpt :P

Whenever I see people doing something like this I remember I did the same when I was in 10 people startups and it required A LOT of work to keep all these things running (mostly because back then we didn't have all these cloud managed systems) and that time would have been better invested in the product instead of wasting time figuring out how these tools work.

I see value in this kind of work if you're at the scale of something like Dropbox and moving from S3 will greatly improve your bottom line and you have a team that knows exactly what they're doing and will be assigned the maintenance of this work. If this is being done merely from a cost cutting perspective and you don't have the people that understand these systems, its a recipe for disaster and once shit is on fire the people that would be assigned to "fix" the problem will quickly disappear because the "on call schedule is insane".


> and that time would have been better invested in the product instead of wasting time figuring out how these tools work

It really depends on what you're doing. Back then a lot of non-VC startups worked better and the savings possibly helped. It also helps grow the team and have less reliance on the vendor. It's long term value.

Is it really time wasted? People often go into resume building mode and do all kinds of wacky things regardless. Perhaps this just helps scratch that itch.


Definitely fine from a personal perspective and resume building, it's just not in the best interest of the business because as soon as the person doing resume building is finished they'll jump ship. I've definitely done this myself.

But i don't see this being good from a pure business perspective.


> it's just not in the best interest of the business because as soon as the person doing resume building is finished they'll jump ship. I've definitely done this myself.

I certainly hope not everyone does so. I've seen plenty of people lean choices based on resume / growth / interest than the pure good of the business but not to leave after doing so.

> But i don't see this being good from a pure business perspective.

And a business at the end of the day is operated by its people. Sure, there are a odd few that operate in good faith, but we're not robots or AI. I doubt every decision everywhere is 100% business optimal and if it's the only criteria.


I bailed out of one company because even though the stack seemed conceptually simple in terms of infra (there wasn't a great deal to it), the engineering more than compensated for it. The end result was the same: non-stop crisis management, non-stop firefighting, no capacity to work on anything new, just fixing old.

All by design, really, because at that point you're not part of an engineering team you're a code monkey operating in service of growth metrics.


> ... I remember I did the same when I was in 10 people startups and it required A LOT of work to keep all these things running...

Honest question: how long ago was that? I stepped away from that ecosystem four or so years ago. Perhaps ease of use has substantially improved?


> you also removed monitoring of the platform

You don't think they have any monitoring within Kubernetes?

I imagine they have more monitoring capabilities now than they did with Heroku.


The fact that HN seems to think this is "FUD" is absolutely wild. You just talked about (some of) the tradeoffs involved in running all this stuff yourself. Obviously for some people it'll be worth and for others not, but absolutely amazing that there are people who don't even seem to accept that those tradoffs exist!


I assume you reference my comment.

The reason I think parent comment is FUD isn't because I don't acknowledge tradeoffs (they are very real).

It's because parent comment implies that people behind "reclaim the stack" didn't account for the monitoring, people's cost etc.

Obviously any reasonable person making that decision includes it into calculation. Obviously nobody sane throws entire monitoring out of the window for savings.

Accounting for all of these it can be still viable and significantly cheaper to run own infra. Especially if you operate outside of the US and you're able to eat an initial investment.


Not your comment specifically, you're one of many saying FUD.

Honestly if you accept that the comment was talking about real tradeoffs then I'm a bit baffled that you though it was FUD. It seems like an important thing to be talking about when there's a post advocating moving away from PaaS and doing it all yourself. It's great if you already knew all about all that and didn't need to discuss it, but just stare into the abyss of the other comments and you'll see that others very much don't understand those tradeoffs at all.


Exactly. It all depends on your needs and — to be honest — the quality of your sysops engineering. You may not only need dedicated sysops, but you may incur higher incidental costs with lost productivity when your solution inevitably goes down (or just from extra dev load when things are harder to use).

That said, at least in 2016 Heroku was way overpriced for high volume sites. My startup of 10 engineers w/ 1M monthly active users saved 300k+/yr switching off heroku. But we had Jerry. Jerry was a beast and did most of the migration work in a month, with some dead-simple AWS scaling. His solution lacked many of the features of Heroku, but it massively reduced costs for developers running full test stacks which, in turn increased internal productivity. And did I mention it was dead simple? It's hard to overstate how valuable this was for the rest of us, who could easily grok the inner workings and know the consequences of our decisions.

Perhaps this stack will open that opportunity to less equipped startups, but I've found few open source "drop-in replacements" to be truly drop-in. And I've never found k3 to be dead simple.


Sorry, but that's just ton of FUD. We run both private cloud and (for a few customers) AWS. Of course you have more maintenance on on-prem, but typical k8s update is maybe a few hours of work, when you know what you are doing.

Also AWS is also, complex, also requires configuration and also generates alerts in the middle of the night.

It's still a lot cheaper than managed service.


> Of course you have more maintenance on on-prem, but typical k8s update is maybe a few hours of work, when you know what you are doing.

You just mentioned one dimension of what I described, and "when you know what you are doing" is doing a lot of the heavy lifting in your argument.

> Also AWS is also, complex, also requires configuration and also generates alerts in the middle of the night.

I'm confused. So we are on agreement there?

I feel you might be confusing my point with an on-prem vs AWS discussion, and that's not it.

This is encouraging teams to run databases / search / cache / secrets and everything on top of k8s and assuming a magic k8s operator is doing the same job as a team of humans and automation managing all those services for you.


> assuming a magic k8s operator is doing the same job as a team of humans and automation managing all those services for you.

What do you think AWS is doing behind the scenes when you run Postgres RDS? It's their own equivalent of a "K8S operator" managing it. They make bold claims about how good/reliable/fault-tolerant it is, but the truth is that you can't actually test or predict its failure modes, and it can fail and fails badly (I've had it get into a weird state where it took 24h to recover, presumably once an AWS guy finally SSH'd in and fixed it manually - I could've done the same but without having to wait 24h).


Fair, but my point is that AWS has a full team of people that built and contributed to that magic box that is managing the database. When something goes wrong, they're the first ones to know (ideally) and they have a lot of know-how on what went wrong, what the automation is doing, how to remediate issues, etc.

When you use a k8s operator you're using an off the shelve component with very little idea of what is doing and how. When things go wrong, you don't have a team of experts to look into what failed and why.

The tradeoff here is obviously cost, but my point is those two levels of "automation" are not comparable.

Edit: well, when I write "you" I mean most people (me included)


> Fair, but my point is that AWS has a full team of people that built and contributed to that magic box that is managing the database.

You sure about that? I used to work at AWS, and although I wasn't on K8S in particular, I can tell you from experience that AWS is a revolving door of developers who mostly quit the instant their two-year sign-on bonus is paid out, because working there sucks ass. The ludicrous churn means there actually isn't very much buildup of institutional knowledge.


> Fair, but my point is that AWS has a full team of people that built and contributed to that magic box that is managing the database

You think so. The real answer is maybe maybe not. They could have all left and the actual maintainers now don't actually know the codebase. There's no way to know.

> When things go wrong, you don't have a team of experts to look into what failed and why.

I've been on both sides of consulting / managed services teams and each time the "expert" was worse than the junior. Sure, there's some luck and randomness but it's not as clear cut as you make it.

> and they have a lot of know-how on what went wrong, what the automation is doing, how to remediate issues, etc.

And to continue on the above I've also worked at SaaS/IaaS/PaaS where the person on call doesn't know much about the product (not always their fault) and so couldn't contribute much on incident.

There's just to much trust and good faith in this reply. I'm not advocating to manage everything yourself but yes, don't trust that the experts have everything either.


If you don't want complexity of operators, you'll be probably OK with DB cluster outside of k8s. They're quite easy to setup, automate and there are straightforward tools to monitor them (eg. from Percona).

If you want to fully replicate AWS it may be more expensive than just paying AWS. But for most use cases it's simply not necessary.


As with everything it's not black or white, but rather a spectrum. Sure, updating k8s is not that bad, but operating a distributed storage solution is no joke. Or really anything that requires persistence and clustering (like elastic).

You can also trade operational complexity for cash via support contracts and/or enterprise solutions (like just throwing money at Hitachi for storage rather than trying to keep Ceph alive).


If you don't need something crazy you can just grab what a lot of enterprises already had done for years, which is drop a few big storage servers and call it a day, connecting over iSCSI/NFS/whatever


If you are in Kubernetes land you probably want object storage and some kind of PVC provider. Not thaaat different from an old fashioned iSCSI/NFS setup to be honest, but in my experience different enough to cause friction in an enterprise setting. You really don't want a ticket-driven, manual, provisioning process of shares


a PVC provider is nice, sure, but depending on how much you need/want simplest cases can be "mount a subdirectory from common exported volume", and for many applications ticket-based provisioning will be enough.

That said on my todo-list is some tooling to make simple cases with linux NFS or SMI-capable servers work as PVC providers.


Sure, but it requires that your engineers are vertically capable. In my experience, about 1 in 5 developers has the required experience and does not flat out refuse to have vertical responsibility over their software stack.

And that number might be high, in larger more established companies there might be more engineers who want to stick to their comfort bubble. So many developers reject the idea of writing SQL themselves instead of having the ORM do it, let alone know how to configure replication and failover.

I'd maybe hire for the people who could and would, but the people advocating for just having the cloud take care of these things have a point. You might miss out on an excellent application engineer, if you reject them for not having any Linux skills.


Our devs are responsible for their docker image and the app. Then other team manages platform. You need some level of cooperation of course, but none of the devs cares too much about k8s internals or how the storage works.


Original creator and maintainer of Reclaim the Stack here.

> you also removed monitoring of the platform

No we did not: Monitoring: https://reclaim-the-stack.com/docs/platform-components/monit...

Log aggregation: https://reclaim-the-stack.com/docs/platform-components/log-a...

Observability is on the whole better than what we had at Heroku since we now have direct access to realtime resource consumption of all infrastructure parts. We also have infinite log retention which would have been prohibitively expensive using Heroku logging addons (though we cap retention at 12 months for GDPR reasons).

> Who/What is going to be doing that on this new platform and how much does that cost?

Me and my colleague who created the tool together manage infrastructure / OS upgrades and look into issues etc. So far we've been in production 1.5 years on this platform. On average we spent perhaps 3 days per month doing platform related work (mostly software upgrades). The rest we spend on full stack application development.

The hypothesis for migrating to Kubernetes was that the available database operators would be robust enough to automate all common high availability / backup / disaster recovery issues. This has proven to be true, apart from the Redis operator which has been our only pain point from a software point of view so far. We are currently rolling out a replacement approach using our own Kubernetes templates instead of relying on an operator at all for Redis.

> Now you need to maintain k8s, postgresql, elasticsearch, redis, secret managements, OSs, storage... These are complex systems that require people understanding how they internally work

Thanks to Talos Linux (https://www.talos.dev/), maintaining K8s has been a non issue.

Running databases via operators has been a non issue, apart from Redis.

Secret management via sealed secrets + CLI tooling has been a non issue (https://reclaim-the-stack.com/docs/platform-components/secre...)

OS management with Talos Linux has been a learning curve but not too bad. We built talos-manager to manage bootstrapping new nodes to our cluster straight forward (https://reclaim-the-stack.com/docs/talos-manager/introductio...). The only remaining OS related maintenance is OS upgrades, which requires rebooting servers, but that's about it.

For storage we chose to go with simple local storage instead of complicated network based storage (https://reclaim-the-stack.com/docs/platform-components/persi...). Our servers come with datacenter grade NVMe drives. All our databases are replicated across multiple servers so we can gracefully deal with failures, should they occur.

> Who is going to upgrade kubernetes when they release a new version that has breaking changes?

Ugrading kubernetes in general can be done with 0 downtime and is handled by a single talosctl CLI command. Breaking changes in K8s implies changes to existing resource manifest schemas and are detected by tooling before upgrades occur. Given how stable Kubernetes resource schemas are and how averse the community is to push breaking changes I don't expect this to cause major issues going forward. But of course software upgrades will always require due diligence and can sometimes be time consuming, K8s is no exception.

> What happens when ElasticSearch decides to splitbrain and your search stops working?

ElasticSearch, since major version 7, should not enter split brain if correctly deployed across 3 or more nodes. That said, in case of a complete disaster we could either rebuild our index from source of truth (Postgres) or do disaster recovery from off site backups.

It's not like using ElasticCloud protects against these things in any meaningfully different way. However, the feedback loop of contacting support would be slower.

> When the DB goes down or you need to set up replication?

Operators handle failovers. If we would lose all replicas in a major disaster event we would have to recover from off site backups. Same rules would apply for managed databases.

> What is monitoring replication lag?

For Postgres, which is our only critical data source. Replication lag monitoring + alerting is built into the operator.

It should be straight forward to add this for Redis and ElasticSearch as well.

> Or even simply things like disks being close to full?

Disk space monitoring and alerting is built into our monitoring stack.

At the end of the day I can only describe to you the facts of our experience. We have reduced costs to cover hiring about 4 full time DevOps people so far. But we have hired 0 new engineers and are managing fine with just a few days of additional platform maintenance per month.

That said, we're not trying to make the point that EVERYONE should Reclaim the Stack. We documented our thoughts about it here: https://reclaim-the-stack.com/docs/kubernetes-platform/intro...


Since you're the original creator, can you open the site of your product, and find the link to your project that you open sourced?

- Front page links to docs and disord.

- First page of docs only has a link to discord.

- Installation references a "get started" repo that is... somehow also the main repo, not just "get started"?


The get-started repo is the starting point for installing the platform. Since the platform is gitops based, you'll fork this repo as described in: https://reclaim-the-stack.com/docs/kubernetes-platform/insta...

If this is confusing, maybe it would make sense to rename the repo to "platform" or something.

The other main component is k (https://github.com/reclaim-the-stack/k), the CLI for interacting with the platform.

We have also open sourced a tool for deploying Talos Linux on Hetzner called talos-manager: https://github.com/reclaim-the-stack/talos-manager (but you can use any Kubernetes, managed or self-hosted, so this is use-case specific)


You talk a lot about the platform on the page, in the overview page, and there are no links to the platform.

There's not even an overview of what the platform is, how everything is tied together, and where to look at it except bombastic claims, disparate descriptions of its constituent components (with barely any links to how they are used in the "platform" itself), and a link to a repo called "get-started"


Assuming average salary of 140k/year, you are dedicating 2 resources 3 times a month and this is already costing you ~38k/year on salaries alone and that's assuming your engineers have somehow mastered_both_ devops and software (very unlikely) and that they won't screw anything up. I'm not even counting the time it took you to migrate away..

This also assumes your infra doesn't grow and requires more maintenance or you have to deal with other issues.

Focusing on building features and generating revenue is much valuable than wasting precious engineering time maintain stacks.

This is hardly a "win" in my book.


Right, because your outsourced cloud provider takes absolutely zero time of any application developers. Any issue with AWS and GCP is just one magic support ticket away and their costs already includes top priority support.

Right? Right?!


Heroku isn’t really analogous to AWS and GCP. Heroku actually is zero effort for the developers.


> Heroku actually is zero effort for the developers.

This is just blatantly untrue.

I was an application developer at a place using Heroku for over four years, and I guarantee you we exceeded the aforementioned 2-devs-3-days-per-month in man hours in my time there due to Heroku:

- Matching up local env to Heroku images, and figuring out what it actually meant when we had to move off deprecated versions

- Peering at Heroku charts because lack of real machine observability, and eventually using Node to capture OS metrics and push them into our existing ELK stack because there was just no alternative

- Fighting PR apps to get the right set of env vars to test particular features, and maintaining a set of query-string overrides because there was no way to automate it into the PR deploy

I'm probably forgetting more things, but the idea that Heroku is zero effort for developers is laughable to me. I hate docker personally but it's still way less work than Heroku was to maintain, even if you go all the way down the rabbit hole of optimizing away build times et.


> Assuming average salary of 140k/year

Is that what developers at your company cost?

Just curious. In Sweden the average devops salary is around 60k.

> you are dedicating 2 resources 3 times a month and this is already costing you ~38k/year on salaries

Ok. So we're currently saving more than 400k/year on our migration. That would be worth 38k/year in salaries to us. But note that our actual salary costs are significantly lower.

> that's assuming your engineers have somehow mastered_both_ devops and software (very unlikely)

Both me and my colleague are proficient at operations as well as programming. I personally believe the skillsets are complimentary and that web developers need to get into operations / scaling to fully understand their craft. But I've deployed web sites since the 90s. Maybe I'm a of a different breed.

We achieved 4 nines of up time in our first year on this platform which is more than we ever achieved using Heroku + other managed cloud services. We won't reach 4 nines in our second year due to a network failure on Hetzner, but so far we have not had downtime due to software issues.

> This also assumes your infra doesn't grow and requires more maintenance

In general the more our infra grows the more we save (and we're still in the process of cutting additional costs as we slowly migrate more stuff over). Since our stack is automated we don't see any significant overhead in maintenance time for adding additional servers.

Potentially some crazy new software could come along that would turn out to be hard to deploy. But if it would be cheaper to use a managed option for that crazy software we could still just use a managed service. It's not like we're making it impossible to use external services by self-hosting.

Note that I wouldn't recommend Reclaim the Stack to early stage startups with minor hosting requirements. As mentioned on our site I think it becomes interesting around $5,000/month in spending (but this will of course vary on a number of factors).

> Focusing on building features and generating revenue is much valuable than wasting precious engineering time maintain stacks.

That's a fair take. But the trade-offs will look different for every company.

What was amazing for us was that the developer experience of our platform ended up being significantly better than Heroku's. So we are now shipping faster. Reducing costs by an order of magnitude also allowed us to take on data intensive additions to our product which we would have never considered in the previous deployment paradigm since costs would have been prohibitively high.


> Just curious. In Sweden the average devops salary is around 60k.

Well there's salary, and total employee cost. Now sure how it works in Sweden, but here in Belgium it's a good rule of thumb that an employer pays +- 2,5 times what an employee nets at the end after taxes etc. So say you get a net wage of €3300/month or about €40k/year ends up costing the employer about €100k.

I'm a freelance devops/sre/platform engineer, and all I can tell you is that even for long-term projects, my yearly invoice is considerably higher than that.


This is more FUD. Employer cost is nowhere near 2.5x employee wages.


Hey there, this is a comprehensive and informative reply!

I had two questions just to learn more.

* What has been your experience with using local NVMes with K8s? It feels like K8s has some assumptions around volume persistence, so I'm curious if these impacted you at all in production.

* How does 'Reclaim the Stack' compare to Kamal? Was migrating off of Heroku your primary motivation for building 'Reclaim the Stack'?

Again, asking just to understand. For context, I'm one of the founders at Ubicloud. We're looking to build a managed K8s service next and evaluating trade-offs related to storage, networking, and IAM. We're also looking at Kamal as a way to deploy web apps. This post is super interesting, so wanted to learn more.


K8s works with both local storage and networked storage. But the two are vastly different from an operations point of view.

With networked storage you get fully decoupled compute / storage which allows Kubernetes to reschedule pods arbitrarily across nodes. But the trade off is you have to run additional storage software, end up with more architectural complexity and get performance bottlenecked by your network.

Please check out our storage documentation for more details: https://reclaim-the-stack.com/docs/platform-components/persi...

> How does 'Reclaim the Stack' compare to Kamal?

Kamal doesn't really do much at all compared to RtS. RtS is more or less a feature complete Heroku alternative. It comes with monitoring / log aggregation / alerting etc. also automates High Availability deployments of common databases.

Keep in mind 37 signals has a dedicated devops team with 10+ engineers. We have 0 full time devops people. We would not be able to run our product using Kamal.

That said I think Kamal is a fine fit for eg. running a Rails app using SQLite on a single server.

> Was migrating off of Heroku your primary motivation for building 'Reclaim the Stack'?

Yes.

Feel free to join the Discord and start a conversation if you want to bounce ideas for your k8s service :)


Who says they reduced costs by cutting staff? They could instead have scaled their staff better.


>Who/What is going to be doing that on this new platform and how much does that cost?

If you're already a web platform with hired talent (and someone using Heroku for a SaaS probably already is), I'd be surprised if the marginal cost was 10x.that paid support is of course coming at a premium, and isn't too flexible on what level of support you need.

And yeah, it isn't apples to apples. Maybe you are in a low CoL area and can find a decent DevOps for 80-100k. Maybe you're in SF and any extra dev will be 250k. It'll vary immensely on cost.


This is FUD unless you're running a stock exchange or payment processor where every minute of downtime will cost you hundreds of thousands. For most businesses this is fear-mongering to keep the DevOps & cloud industry going and ensure continued careers in this field.


It's not just about downtime, but also about not getting your systems hacked, not losing your data if sh1t hits the fan, regulation compliance, flexibility (e.g. ability to quickly spin-out new test envs) etc.

My preferred solution to this problem is different, though. For most businesses, apps, a monolith (maybe with a few extra services) + 1 relational DB is all you need. In such a simple setup, many of the problems faced either disappear or get much smaller.


> also about not getting your systems hacked...

The only systems I have ever seen get compromised firsthand were in public clouds and because they were in public clouds. Most of my career has been at shops that, for one reason or another, primarily own their own infrastructure, cloud represents a rather small fraction. It's far easier to secure a few servers behind a firewall than figure out the Rube Goldberg Machine that is cloud configuration.

> not losing your data if sh1t hits the fan

You can use off-site backup without using cloud systems, you know? Backblaze, AWS Glacier, etc. are all pretty reasonable solutions. Most of the time when I've seen the need to exercise the backup strategy it's because of some software fuckup, not something like a disk dying. Using a managed database isn't going to save you when the intern TRUNCATEs the prod database on accident (and if something like that happens, it means you fucked up elsewhere).

> regulation compliance

Most shops would be way better suited to paying a payment processor like Stripe, or other equivalent vendors for similarly protected data. Defense is a whole can of worms, "government clouds" are a scam that make you more vulnerable to an unauthorized export than less.

> flexibility (e.g. ability to quickly spin-out new test envs) etc.

You actually lose flexibility by buying into a particular cloud provider, not gain it. Some things become easier, but many things become harder. Also, IME the hard part of creating reasonable test envs is configuring your edge (ingress, logging infra) and data.


Speaking of the exchanges (at least the sanely operated ones), there’s a reason the stack is simplified compared to most of what is being described here.

When some component fails you absolutely do not want to spend time trying to figure out the underlying cause. Almost all the cases you hear in media of exchange outages are due to unnecessary complexity added to what is already a remarkably complex distributed (in most well designed cases) state machine.

You generally want things to be as simple and streamlined as possible so when something does pop (and it will) your mean time to resolution is inside of a minute.


I run a business that is a long long way from a stock exchange or a payment processor. And while a few minutes of downtime is fine 30 minutes or a few hours at the wrong time will really make my customers quite sad. I've been woken in the small hours with technical problems maybe a couple of times over the last 8 years of running it and am quite willing to pay more for my hosting to avoid that happening again.

Not for Heroku, they're absolute garbage these days, but definitely for a better run PaaS.

Plenty of situations where running it yourself makes sense of course. If you have the people and the skills available (and the cost tradeoffs make sense) or if downtime really doesn't matter much at all to you then go ahead and consider things like this (or possibly simpler self hosting options, it depdns).But no, "you gotta run kubernettes yourself unless you're a stock exchange" is not a sensible position.


I don't know why people don't value their time at all. PaaS are so cheap these days for the majority of projects, that it just is not worth it to spend your own time to manage the whole infrastructure stack.

If you're forced by regulation or if you just want to do it to learn, than yeah. But if your business is not running infra, or if your infra demands aren't crazy, then PaaS and what-have-you-flavored-cloud-container products will cost you ~1-2 work weeks of a single developer annually.


Unless you already know how to run infra quickly and efficiently. Which – spoiler – you can achieve if you want to learn.


It's not FUD, it's pointing out a very real fact that most problems are not engineering problems that you can fix by choosing the one "magical" engineering solution that will work for all (or even most) situations.

You need to understand your business and your requirements. Us engineers love to think that we can solve everything with the right tools or right engineering solutions. That's not true. There is no "perfect framework." No one sized fits all solution that will magically solve everything. What "stack" you choose, what programming language, which frameworks, which hosting providers ... these are all as much business decisions as they are engineering decisions.

Good engineering isn't just about finding the simplest or cheapest solution. It is about understanding the business requirements and finding the right solution for the business.


Having managers (business people) make technical decisions based on marketing copy is how you get 10 technical problems that metastasize into 100 business problems, usually with little awareness of how we got there in the first place.


Nice straw-man. I never once suggested that business people should be making technical decisions. What I said was that engineering solutions need to serve the needs of the business. Those are insanely different statements. They are so different that I think that you actively tried to misinterpret my comment so that you could shoot down something I didn't say.


Well, you're using an overbroad definition of "business decisions", so forgive my interpretation. Of course everyone that goes on in a business could be conflated as a "business decision". But not everyone at the business is an MBA, so to speak. "Business" has particular semantics in this case, otherwise "engineering/technical" becomes an empty descriptor.


[flagged]


Not sure if this is going to help Heroku's people at all but I feel bad for them now! haha I'm not a Heroku employee. I don't even work in any sort of managed service / platform provider. This is indeed a new account but not a throwaway account! I intended to use it long term.


You really think that, incredibly lukewarm, argument for Heroku is so extreme that it could only have been written by some kind of undercover shill?


Why, yes?


Please don’t do this. It’s against HN’s guidelines.

Please don't post insinuations about astroturfing, shilling, brigading, foreign agents, and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data.

https://news.ycombinator.com/newsguidelines.html


Since DHH has been promoting the 'do-it-yourself' approach, many people have fallen for it.

You're asking the right questions that only a few people know they need answers to.

In my opinion, the closest thing to "reclaiming the stack" while still being a PaaS is to use a "deploy to your cloud account" PaaS provider. These services offer the convenience of a PaaS provider, yet allow you to "eject" to using the cloud provider on your own should your use case evolve.

Example services include https://stacktape.com, https://flightcontrol.dev, and https://www.withcoherence.com.

I'm also working on a PaaS comparison site at https://paascout.io.

Disclosure: I am a founder of Stacktape.


I made the mistake of falling for the k8s hype a few years back for running all of my indie hacker businesses.

Big mistake. Overnight, the cluster config files I used were no longer supported by the k8s version DigitalOcean auto upgraded my cluster to and _boom_. Every single business was offline.

Made the switch to some simple bash scripts for bootstrapping/monitoring/scaling and systemd for starting/restarting apps (nodejs). I'll never look back.


Weird how defensive people get about K8S when you say stuff like this. It’s like they’re desperately trying to convince you that you really do need all that complexity.


I believe there's still a lot of potential for building niche / "human-scale" services/businesses, that don't inherently require the scalability of the cloud or complexity of k8s. Scaling vertically is always easier, modern server hardware has insane perf ceiling. The overall reduction in complexity is a breath of fresh air.

My occasional moral dilemma is idle power usage of overprovisioned resources, but we've found some interesting things to throw at idle hardware to ease our conscience about it.


I particularly like this moniker for such human-scale, "digital gardening"-type software: https://hobbit.software/


I think it's two types of defensiveness.

1. Shovel salesman insisting all "real" gold miners use their shovels

2. Those that have already acquired shovels not wanting their purchase to be mocked/have been made in vain.

Neither are grounded in reality. Why people believe their tiny applications require the same tech that Google invented to help manage their (massive) scale is beyond me.


Most do not, but they still want all the toys that developers are building for “the cloud”.


I use k8s for the last uhh 5 years and this never happened to me. In my case, because I self-host my cluster, do no unexpected upgrades. But I agree that maintaining k8s cluster takes some work.


In the 2015-2019 period there were quite a few API improvements involving deprecating old APIs, it’s much more stable/boring now. (Eg TPR -> CRD was the big one for many cluster plugins)


So either digital ocean auto updates breaking versions. Or k8s doesn't do versioning correctly. Both very bad.

Which was it?


Technically both, but more so the former.

I had a heck of a time finding accurate docs on the correct apiVersion to use for things like my ingress and service files (they had a nasty habit of doing beta versions and changing config patterns w/ little backwards compatibility). This was a few years back when your options were a lot of Googling, SO, etc, so the info I found was mixed/spotty.

As a solo founder, I found what worked at the time and assumed (foolishly, in retrospect) that it would just continue to work as my needs were modest.


I assume the first one, but it's more complicated. K8s used to have a lot of features (included very important ones) in the "beta" namespace. There are no stability guarantees there, but everyone used them anyway. Over time they graduated to the "stable" namespace, and after some transitory period they were removed from the beta namespace. This broke old deployments, when admins ignored warnings for two or three major releases.


Just want to mention that two or three major releases sounds very bad, but Kubernetes had the insane release cadence of 4(!) major versions every year.


It's an odd choice to break backwards compatibility by removing them from the beta namespace. Why not keep them available in both indefinitely?


Probably because the devs understandably can't account for every possible way people might be using it when shipping new features. But in my experience this means k8s is a bag of fiddly bits that requires some serious ops investments to be reliable for anything serious.


With one exception that was rather big change to some low-level stuff, the "remove beta tags" was done with about a year or more of runway for people to upgrade.

And ultimately, it wasn't hard to upgrade, even if you deal with auto-upgrading cluster and forgot about it, because "live" deployments got auto-upgraded - you do need to update your deployment script/whatever though.


How does it compare to a simpler but not hand-crafted solution, such as dokku?


No Docker for starters. I played with Dokku a long time ago and remember it being decent at that time, but still too confusing for my skillset.

Now, I just build my app to an encrypted tarball, upload it to a secure bucket, and then create a short-lived signed URL for instances to curl the code from. From there, I just install deps on the machine and start up the app with systemd.

IMO, Docker is overkill for 99% of projects, perhaps all. One of those great ideas, poorly executed (and considering the complexity, I understand why).


> simple bash scripts for bootstrapping/monitoring/scaling

Damn, that's the dream right there


The first live k8s cluster upgrade anyone has to do is usually when they think "what the fuck did I get myself in to?"

It's only good for very large scale stuff. And then a lot of the time that is usually well over provisioned and could be done considerably cheaper using almost any other methodology.

The only good part of Kubernetes I have found in the last 4 years of running it in production is that you can deploy any old limping crap to it and it does its best to keep it alive which means you can spend more time writing YAML and upgrading it every 2 minutes.


We're also ignoring Kubernetes and are just using GitHub Actions, Docker Compose and SSH for our CI Deployments [1]. After a one-time setup on the Deployment Server, we can deploy new Apps with just a few GitHub Action Secrets, which then gets redeployed on every commit, including running any DB Migrations. We're currently using this to deploy and run over 50 .NET Apps across 3 Hetzner VMs.

[1] https://servicestack.net/posts/kubernetes_not_required


The amount of complexity people are introducing into their infrastructure is insane. At the end of the day, we're still just building the same CRUD web apps we were building 20 years ago. We have 50x the computation power, much faster disk, much more RAM, and much faster internet.

A pair of load-balanced web servers and a managed database, with Cloudflare out front, will get you really, really far.


EKS has a tab in the dashboard that warns about all the deprecated configs in your cluster, making it pretty foolproof to avoid this by checking every couple years.


Yes, and there are many open source tools that you can point at clusters to do the same. We use Kubent (Kube No Troubles) to do the same.


yeouch. sorry man. I've been running in AKS for 3-4 years now and never had an auto-upgrade come in I wasn't expecting. I have been ontop of alerts and security bulletins though, may have kept me ahead of the curve.


I was once on a nice family holiday and broke my resolve and did a 'quick' check of my email and found a nastygram billing reminder from a provider. On the one hand I was super-lucky I checked my mail when I did, and on the other I didn't get he holiday I needed and was lucky to not spill over and impact my family's happiness around me.


So what is the alternative? Nomad?


So you had auto update enabled on your cluster and didn’t keep your apiversions up to date?

Sounds like user error.


One of my main criteria for evaluating a platform would be how easy it is to make user errors.


To be honest the API versions have been a lot more stable recently but back in ~2019 when I first used Kube in production, basic APIs were getting deprecated left and right, 4 times a year; in the end yes the problems are "on you" but it so easy to miss and the results so disastrous for a platform whose selling points are toughness resilience and self-healing


I wish _I_ had a business that was successful enough to justify multiple engineers working 7 months on porting our infrastructure from heroku to kubernetes


Knowing the prices and performance of Heroku (as a former customer) the effort probably paid for itself. Heroku is great for getting started but becomes untenably expensive very fast, and it's neither easy nor straightforward to break the vendor lock in when you decide to leave.


I find AWS ECS with fargate to be a nice middle ground. You still have to deal with IAM, networking, etc. but once you get that sorted it’s quite easy to auto-scale a container and make it highly available.

I’ve used kubernetes as well in the past and it certainly can do the job, but ECS is my go-to currently for a new project. Kubernetes may be better for more complex scenarios, but for a new project or startup I think having a need for kubernetes vs. something simpler like ECS would tend to indicate questionable architecture choices.


ECS is far, far far smoother, simpler and stable than anything else out there in cluster orchestration. It just works. Even with EC2 instances it just works. And if you opt for Fargate, then that's far more stable option.

I am saying this after bootstrapping k8s and ECS both.


The only pain point there I think is auto scaling logic. But otherwise it’s painless.


I find auto-scaling with fargate to be pretty straightforward. What's the pain point there for you?


It works, but the way it’s not part of fargate, and instead some combination of cloudwatch events and rules modifying the ‘desiredCount’ property on the service.

Just feel like it could all be done in a slightly more integrated way.


I pretty strongly agree. Fargate is a great product, though it isn't quite perfect.


How does it compare with fly.io? Last I checked, startup time is still in minutes instead of less than a second on fly, but I presume it's more reliable and you get that "nobody ever got fired for using AWS" effect


Fly is really cool and it's definitely an extremely quick way to get a container running in the cloud. I haven't used it in production so I can't speak to reliability, but for me the main thing that stops me from seriously considering it is the lack of an RDS equivalent that runs behind the firewall.


Fly.io is an unreliable piece of shit.


GCP Cloud Run is even better, which you don't have to configure those networking stuff, just ship and run in production


Does Cloud Run give you a private network? While configuring it is annoying, I do want one for anything serious.


+1 on this. ECS Fargate is great.


From their presentation, they went from $7500/m to $500/m


assume a dev is $100k/year... so $200k with taxes, benes, etc. That's 16,666/month, at 1.5 months is 25k. So it'll take 3.5 months to break even. And they'd save around .8 of their pay, or .4 of their total cost a year...

Generally I am hoping my devs are working a good multiplier to their pay for revenue they generate. Not sure I'd use them this way if there was other things to do.

That said sounds like it was mainly for GDPR so.


Where are you finding capable DevOps engineers for 100k total comp? It’s hard to find someone with the skills to rebuild a SaaS production stack who’s willing to work for that little around here!


100k Eur is high salary for a dev. A unicorn who knows they are one, won't agree to that salary, but would for 150k or 200k.


Europe, probably


Correct, mynewsdesk (that created this reclaim the stack thing) is a Swedish company.


I'm picking a random salary that's not too high that lower comp countries/industries will reject, and not too low that higher comp countries / industries will reject, and then doing math on those.

You can then take my numbers or math and plug in YOUR comp rates. But the TL;DR I've seen is many people never even do napkin math like this on ROI.


Now consider that some places are not in Silly Valley, or not even in USA, and the fully loaded cost of engineer (who, once done with on-prem or at least more "owned" stack, can take on other problems) can be way, way lower


These numbers are actually WAY LOW for silicon valley. If I was doing the ROI for an Amazon employee I'd start with around 350 for an SDE all told for entry level and half a mill for one with a few years experience.

But they're also way high for other places. And just right.

The point is how to do the math not the ballpark. Also that even at 100k for a dev it's maybe a wash depending on your time horizon.


My experience is that a lot of "simpler alternatives" ballooned costs beyond cost of someone to wrangle the more complex solution - and well, after initial pains, the workload drops so you can have them tackle other problems if not at full time.

Or as I said it few times at meetups, Heroku is what I use when I want to go bankrupt (that was before Heroku got sold)


We've since migrated more stuff. We're currently saving more than $400k/year.


I mean, $7,000 a month isn’t nothing. But it’s not a lot. Certainly not enough to justify a seven month engineering effort plus infinite ongoing maintenance.


This is $7k/mo today. If they are actively growing, and their demand for compute is slated to grow 5x or even 10x in a year, they wanted to get off Heroku fast.


The main engineering effort to reduce by that much was completed in 6 weeks according to their YouTube video.

7 months is presumably more like “the time it has been stable for” or so, although I am not sure the dates line up for that 100%.

Also cost reduction was apparently not the main impetus, GDPR compliance was.


That “main engineering effort” will go on forever. People neglect to note that everything is constantly changing. Just like the roof on your house, if you don’t upgrade your components regularly, eventually you will face a huge rewrite when that thing your ancient home-made infrastructure relies on is no longer supported or is no longer updated to support the latest thing you need for your SaaS.

You can’t avoid this cost. Some people refer to it as technical debt, but I think more accurately it could be called “infrastructure debt”. Platform providers maintain the infrastructure debt for you - this is what you pay them for. And they do it with tremendous economies of scale. Unless your scale is truly enormous - like Meta, for instance - it isn’t worth build your own infrastructure.


Would you say one person not working 100% of the time is also quite minor? ;)


Sure. We have around 10 of those. It’s a significant boon to the project for them to do nothing.


Moving from Heroku to Render or Fly.io is very straight forward; it’s just containers.


(Except for Postgres, since Fly's solution isn't managed)

Heroku's price is a persistent annoyance for every startup that uses it.

Rebuilding Heroku's stack is an attractive problem (evidenced by the graveyard of Heroku clones on Github). There's a clear KPI ($), Salesforce's pricing feels wrong out of principle, and engineering is all about efficiency!

Unfortunately, it's also an iceberg problem. And while infrastructure is not "hard" in the comp-sci sense, custom infra always creates work when your time would be better spent elsewhere.


> Salesforce's pricing feels wrong out of principle

What do you mean exactly? If it takes multiple engineers multiple months to build an alternative on kubernetes, then it sounds like Heroku is worth it to a lot of companies. These costs are very "known" when you start using Heroku too, it's not like Salesforce hides everything from you then jump scares you 18 months down the line.

SF's CRM is also known to be expensive, and yet it's extremely widely used. Something being expensive definitely doesn't always mean it's bad and you should cheap out to avoid it.


Couldn't you move to AWS? They offer managed Postgresql. Heroku already runs on AWS, so there could be a potential saving in running AWS managed service.

It's still a lot of work obviously.


So does GCP and Azure. At least in GCP land the stuff is really quite reasonably priced, too.


I moved our entire stack from Heroku to Render in a day and pay 1/3 less. Render is what Heroku would be if they never stopped innovating. Now I’m thinking of moving to fly as they are even cheaper.


If you use containers. If you're big enough for the cost savings to matter, you're probably also not looking for a service like Render or Fly. If your workload is really "just containers" you can save more with even managed container services from AWS or GCP.


We are talking about moving from Heroku, I don't think being too needy for the likes of Fly is at all a given. (And people will way prematurely think they're too big or needy for x.)


Technically, you don’t even need to set up containers for Render.


So is kubernetes. GKE isn't that bad.


Unless you relied on heroku build packs.


Buildpacks is opensource too [1]

[1] https://buildpacks.io/


I mean this is what they recommend:

- Your current cloud / PaaS costs are north of $5,000/month - You have at least two developers who are into the idea of running Kubernetes and their own infrastructure and are willing to spend some time learning how to do so

So you will spend 150k+/year (2 senior full stake eng salaries in EU - can be much higher, esp for people up to the task) to save 60k+/y in infra costs?

Does not compute for me - is the lock-in that bad?

I understand it for very small/simple use cases - but then do you need k8s at all?

It feels like the ones who will benefit the most is orgs who spend much more on cloud costs - but they need SLAs, compliance and a dozen other enterprisy things.

So I struggle to understand who would benefit from this stack reclaim.


Creator of Reclaim the Stack here.

The idea that we're implying you need 2 full time engineers is a misunderstanding. We just mean to say that you'll want at least 2 developers to spend enough time digging in to Kubernetes etc to have a good enough idea of what you're doing. I don't think more than 2 month of messing about should be required to reach proficiency.

We currently don't spend more than ~4 days per month total working on platform related stuff (often we spend 0 days, eg. I was on parental leave during 3 months and no one touched the platform during that time).

WRT employee cost, Swedish DevOps engineers cost less than half of what you mentioned on average, but I guess YMMV depending on region.


Fyi, we use asterisks (*) for emphasis on HN


underscores around italics and asterisk around strong/bold was an informal convention on bbs, irc and forums way before atx/markdown.


I'm talking about the HN markup, italics don't work here, only asterisks do


I meant underscores, just noticed


Different thing. Using visible _ is a conscious choice.


Why?


It looks nice and has been a staple in hacker culture for decades, long before we had rich text and were just chatting on IRC.


It doesn't look nice at all to me. Real emphasis looks way nicer, that's its purpose. Now that we have rich text, why not utilize it?


I use both and will continue to do so. You're trying to lecture people who have been on HN for more than 10 years how "we" do stuff around here.


There's so much wrong with this reply I won't even bother trying to respond, I feel the negativity of this comment pour through my screen


Also it looks like _underlined_ text


Who's "we?"


Everyone who uses italics on HN, which is a lot of us: https://news.ycombinator.com/formatdoc


Not to be a pedantic asshole, but those guidelines don't mention italicizing as emphasis, just that * causes italicizes. In fact the OP should probably say that they believe "HN users use italicization to emphasize," which again, who's "we?" _This_ style of emphasis, as others have mentioned, has been bouncing around IRC and whatnot forever.


Most of HN users


In my experience you can get pretty far with just a handful of vms and some bash scripts. At least double digit million ARR. Less is more when it comes to devops tooling imo.


> you can get pretty far with just a handful of vms and some bash scripts. At least double digit million ARR.

Using ARR as the measurement for how far you can scale devops practices is weird to me. Double-digit million ARR might be a few hundred accounts if you're doing B2B, and double-digit million MAUs if you're doing an ad-funded social platform. Depending on how much software is involved your product could be built by a team of anywhere from 1-50 developers.

If you're a one-developer B2B company handling 1-3 requests per second you wouldn't even need more than one VM except maybe as redundancy. But if you're the fifty-developer company that's building something beyond simple CRUD, there are a lot of perks that come with a full-fledged control plane that would almost certainly be worth the added cost and complexity.


> there are a lot of perks that come with a full-fledged control plane that would almost certainly be worth the added cost and complexity.

Such as?

Logging is more complicated with multi container microservice deployments. Deploying is more complicated. Debugging and error tracing is more difficult. What are the perks?


You get more tools to mitigate those problems. Those tools add more complexity to the system, but that's of course solvable by higher level tools.


I used to work at a Fintech company where we had around 1-20k concurrent active users, monthly around 2 million active users. I forget the RPS, but it was maybe around 200-1000 normally? We ran on bare metal, bash scripts, not a container in sight. It was spaghetti, granted, but it worked surprisingly well.


> double-digit million MAUs

I was about to make a similar point, but you made the math, and it's holding-up for the GP's side.

You can push vms and direct to ssh synchronization up to double-digit million MAU (unless you are using stuff like persistent web-sockets). It won't be pretty, but you can get that far.


I'm not concerned about handling the requests for the main user-facing application (as you say, you can get way further with a single box than many people think), I'm thinking about all of the additional complexity that comes with serving multiple millions of human users that wouldn't exist if you were just serving a few hundred web scrapers that happen to produce as much traffic as multiple millions of humans.

What those sources of complexity are depends a lot on the product, but some examples include admin tooling for your CS department, automated content moderation systems, more thorough logging and monitoring, DDOS mitigation, feature flagging and A/B testing, compliance, etc. Not to mention the overhead of coordinating the work of 50 developers vs 1—deploying over SSH is well and good when you can reasonably expect a small handful of people to need to do it, but automatic deploys from main deployed from a secure build machine is a massive boon to the larger team.

Any one of these things has an obvious answer—just add ${software} to your one VM or create one extra bare-metal build server or put your app behind Cloudflare—but when you have a few dozen of these sources of complexity then AWS's control plane offerings start to look very attractive. And once you have 50 developers on the payroll spending a few hundred a month on cloud to avoid hand-rolling solutions isn't exactly a hard sell.


Of course you can get away with that if your metric is revenue. (I think Blippi makes about that much with, I suspect, nary a VM in sight!

The question is what you're doing with your infrastructure, not how much revenue you're making. Some things have higher return to "devops" and others have less.


I agree, this is an incredibly valid approach for some companies and startups. If you benefit by being frugal and are doing something that doesn't need incredible availability, a rack of servers in a colo doesn't cost much and you can take it pretty far without a huge amount of effort.


+1 or just use App Engine, deploy your app and scale


App engine deploys are soooo slow. I liked cloud run a lot more.


It's good to see new projects. However most people shouldn't start with Kubernetes at all. If you don't need autoscaling, give Kamal[0] a go. It's the tool 37signals made to leave Kubernetes and cloud. Works super well with simple VMs. I also wrote a handbook[1] to get people started.

[0] https://kamal-deploy.org [1] https://kamalmanual.com/handbook/


(Reclaim the Stack creator here)

We don't do autoscaling.

The main reason for Kubernetes for us was automation of monitoring / logs / alerting and highly available database deployments.

37signals has a dedicated operations team with more than 10 people. We have 0 dedicated operations people. We would not have been able to run our product with Kamal given our four nines uptime target.

(that said, I do like Kamal, especially v2 seems to smooth out some edges, and I'm all for simple single server deployments)


Bought both your books, they are awesome :)


I’m not going to trust a project like this – made by and for one company – with production workloads.


hahaha, do you even realize what else this company makes?


It looks like a nice Kubernetes setup! But I don’t see how this is comparable to something like Heroku – the complexity is way higher from what I see.

If you’re looking for something simpler, try https://dokku.com/ (the OG self-hosted Heroku) or https://lunni.dev/ (which I’ve been working on for a while, with a docker-compose based workflow instead). (I've also heard good things about coolify.io!)


Since there are so many mixed comments here, I'll share my experience. Our startup started on day one with Kubernetes. It took me about six weeks to write the respective Terraform and manifests and combine them into a homogenous system. It's been smooth sailing for almost two years now.

I'm starting to suspect the wide range of experiences has to do with engineering decisions. Nowadays, it's almost trivial to over-engineer a Kubernetes setup. In fact, with platform engineering becoming all the rage these days, I can't help but notice how over-engineered most reference architectures are for your average mid-sized company. Of course, that's probably by design (Humanitec sure enjoys the money), but it's all completely optional. I intentionally started with a dead-simple EKS setup: flat VPC with no crazy networking, simple EBS volumes for persistence, an ALB on the edge to cover ingress, and External Secrets to sync from AWS Secrets Manager. No service mesh, no fancy BPF shenanigans, just a cluster so simple that replicating to multiple environments was trivial.

The great part is that because we've had such excellent stability, I've been able to slowly build out a custom platform that abstracts what little complexity there was (mostly around writing manifests). I'm not suggesting Kubernetes is for everyone, but the hate it tends to get on HN still continues to make me scratch my head to this day.


“Our basic philosophy when it comes to security is that we can trust our developers and that we can trust the private network within the cluster.”

This is not my area of expertise. Does it add a significant amount of complexity to configure this kind of system in a way that doesn’t require trusting the network? Where are the pain points?


> Our basic philosophy when it comes to security is that we can trust our developers and that we can trust the private network within the cluster.

As an infosec guy, I hate to say it but this is IMO very misguided. Insider attacks and external attacks are often indistinguishable because attackers are happy to steal developer credentials or infect their laptops with malware.

Same with trusting the private network. That’s fine and dandy until attackers are in your network, and now they have free rein because you assumed you could keep the bad people outside the walls protecting your soft, squishy insides.


One of the best things you can do is restrict your VPCs from accessing the internet willy-nilly outbound. When an attacker breaches you, this can keep them from downloading payloads and exfiltrating data.


You’ve just broken a hundred things that developers and ops staff need daily to block a theoretical vulnerability that is irrelevant unless you’re already severely breached.

This kind of thinking is why secops often develops an adversarial relationship with other teams — the teams actually making money.

I’ve seen this dynamic play out dozens of times and I’ve never seen it block an attack. I have seen it tank productivity and break production systems many times however.

PS: The biggest impact denying outbound traffic has is to block Windows Update or the equivalent for other operating systems or applications. I’m working with a team right now that has to smuggle NPM modules in from their home PCs because they can’t run “npm audit fix” successfully on their isolated cloud PCs. Yes, for security they’re prevented from updating vulnerable packages unless they bend over backwards.


> You’ve just broken a hundred things that developers and ops staff need daily to block a theoretical vulnerability that is irrelevant unless you’re already severely breached.

I’m both a developer and a DFIR expert, and I practice what I preach. The apps I ship have a small allowlist for necessary external endpoints and everything else is denied.

Trust me, your vulnerabilities aren’t theoretical, especially if you’re using Windows systems for internet-facing prod.


This should still be fresh in the mind of anyone who was using log4j in 2021.


> I’ve seen this dynamic play out dozens of times and I’ve never seen it block an attack.

I am a DFIR consultant, and I've been involved in 20 or 30 engagements over the last 15 years where proper egress controls would've stopped the adversary in their tracks.


Any statement like that qualified with “proper” is a no true Scotsman fallacy.

What do you consider proper egress blocking? No DNS? No ICMP? No access to any web proxy? No CDP or OCSP access? Strict domain-based filtering of all outbound traffic? What about cloud management endpoints?

This can get to the point that it becomes nigh impossible to troubleshoot anything. Not even “ping” works!

And troubleshoot you will have to, trust me. You’ll discover that root cert updates are out-of-band and not included in some other security patches. And you’ll discover that the 60s delay that’s impossible to pin down is a CRL validating timeout. You’ll discover that ICMP isn’t as optional as you thought.

I’ve been that engineer, I’ve done this work, and I consider it a waste of time unless it is protecting at least a billion dollars worth of secrets.

PS: practically 100% of exfiltrated data goes via established and approved channels such as OneDrive. I just had a customer send a cloud VM disk backup via SharePoint to a third party operating in another country. Oh, not to mention the telco that has outsourced core IT functions to both Chinese and Russian companies. No worries though! They’ve blocked me from using ping to fix their broken network.


there's no need for this to be an either/or decision.

private artifact repos with the ability to act as a caching proxy are easy to set up. afaik all the major cloud providers offer basic ones with the ability to use block or allow lists.

going up a level in terms of capabilities, JFrog is miserable to deal with as a vendor but Artifactory is hard to beat when it comes to artifact management.


Sure… for like one IDE or one language. Now try that for half a dozen languages, tools, environments, and repos. Make sure to make it all work for build pipelines, and not just the default ones either! You need a bunch of on-prem agents to work around the firewall constraints.

This alone can keep multiple FTEs busy permanently.

“Easy” is relative.

Maybe you work in a place with a thousand devs and infinite VC money protecting a trillion dollars of intellectual property then sure, it’s easy.

If you work in a normal enterprise it’s not easy at all.


Their caching proxy sucks though. We had to turn it off because it persistently caused build issues due to its unreliability.


I can't be certain, but I think the GP means production VMs not people's workstations. Or maybe I fail to understand the complexities you have seen, but I'm judging my statement especially on the "download from home" thing which seems only necessary if you packed full Internet access on your workstation.


The entire network has a default deny rule outbound. Web traffic needs to go via authenticating proxies.

Most Linux-pedigree tools don’t support authenticating proxies at all, or do so very poorly. For example, most have just a single proxy setting that’s either “on” or “off”. Compare that to PAC files typically used in corporate environments that implement a fine grained policy selecting different proxies based on location or destination.

It’s very easy to get into a scenario where one tool requires a proxy env var that breaks another tool.

“Stop complaining about the hoops! Just jump through them already! We need you to do that forever and ever because we might get attacked one day by an attacker that’ll work around the outbound block in about five minutes!”


In the scenario presented, can't they just exfiltrate using the developer credentials / machine?


Let’s say there’s a log4j-type vuln and your app is affected. So an attacker can trigger an RCE in your app, which is running in, say, an EC2 instance in a VPC. A well-configured app server instance will have only necessary packages on it, and hopefully not much for dev tools. The instance will also run with certain privileges through IAM and then there won’t be creds on the instance for the attacker to steal.

Typically an RCE like this runs a small script that will download and run a more useful piece of malware, like a webshell. If the webshell doesn’t download, the attacker probably is moving onto the next victim.


But the original comment wasn't about this attack vector...

> attackers are happy to steal developer credentials or infect their laptops with malware

I don't think any of what you said applies when an attacker has control of a developer machine that is allowed inside the network.


I was responding more to "Same with trusting the private network. That’s fine and dandy until attackers are in your network, and now they have free rein because you assumed you could keep the bad people outside the walls protecting your soft, squishy insides."

Obviously this can apply to insiders in a typical corporate network, but it also applies to trust in a prod VPC environment.


That is also a risk. Random developer machines being able to just connect to whatever they like inside prod is another poor architectural choice.


What's your opinion on EDR in general? I find it very distasteful from a privacy perspective, but obviously it could be beneficial at scale. I just wish there was a better middle ground.


Not the OP but I was on that side -

They do work. My best analogy is it's like working at TSA except there are three terrorist attacks per week.

As far as privacy goes, by the same analogy, I can guarantee the operators don't care what porn you watch. Doing the job is more important. But still, treat your work machine as a work machine. It's not yours, it's a tool your company lent to you to work with.

That said, on HN your workers are likely to be developers - that does take some more skill, and I'd advise asking a potential provider frank questions about their experience with the sector, as well as your risk tolerance. Devs do dodgy stuff all the time, and they usually know what they're doing, but when they don't you're going to have real fun proving you've remediated.


EDR is not related to the topic but now I'm curious as well. Any good EDR for ubuntu server?


It's a mindset that keeps people like you and I employed in well-paying jobs.


The top pain point is that it requires setting up SSL certificate infrastructure and having to store and distribute those certs around in a secure way.

The secondary effects are entirely dependent on how your microservices talk to their dependencies. Are they already talking to some local proxy that handles load balancing and service discovery? If so, then you can bolt on ssl termination at that layer. If not, and your microservice is using dns and making http requests directly to other services, it’s a game of whack-a-mole modifying all of your software to talk to a local “sidecar”; or you have to configure every service to start doing the SSL validation which can explode in complexity when you end up dealing with a bunch of different languages and libraries.

None of it is impossible by any means, and many companies/stacks do all of this successfully, but it’s all work that doesn’t add features, can lead to performance degradation, and is a hard sell to get funding/time for because your boss’s boss almost certainly trusts the cloud provider to handle such things at their network layer unless they have very specific security requirements and knowledge.


Yes, it adds an additional level of complexity to do role-based access control within k8s.

In my experience, that access control is necessary for several reasons (mistakes due to inexperience, cowboys, compliance requirements, client security questions, etc.) around 50-100 developers.

This isn't just "not zero trust", it's access to everything inside the cluster (and maybe the cluster components themselves) or access to nothing -- there is no way to grant partial access to what's running in the cluster.


This is just bad security practice. You cannot trust the internal network, so many companies have been abused following this principle. You have to allow for the possibility that your neighbors are hostile.


Implementing "Zero Trust" architectures are definitely more onerous to deal with for everyone involved (both devs and customers, if on prem). Just Google "zero trust architecture" to find examples. A lot more work (and therefore $) to setup and maintain, but also better security since now breaching network perimeter is no longer enough to pwn everything inside said network.


It requires encrypting all network traffic, either with something like TLS, or IPSec VPN.


"SSL added and removed here :^)"


> We spent 7 months building a Kubernetes based platform to replace Heroku for our SaaS product at mynewsdesk.com. The results were a 90% reduction in costs and a 30% improvement in performance.

I don't mean to sound dismissive, but maybe the problem is just that Heroku is/was slow and expensive? Meaning this isn't necessarily the right or quote-unquote "best" approach to reclaiming the stack


How does this compare to dokku (https://dokku.com/)?


Main difference is that Dokku is a simple single server platform, geared mostly toward hobby projects.

Reclaim the Stack provides a fully highly available multi node platform to host large scale SaaS applications aiming for four nines of uptime.


This sounds great, I’ll be building our prod infra stack and deploying to cloud for the first time here in the next few weeks, so this is timely.

It’s nice seeing some OSS-based tooling around k8s. I know it’s a favorite refrain that “k8s is unnecessary/too complex, you don’t need it” for many folks getting started with their deployments, but I already know and use it in my day job, so it feels like a pretty natural choice.


I really hated Kubernetes at first because the tooling is so complicated. However, having worked with raw Docker API and looking into the k8s counterparts, I’m starting to appreciate it a lot more.

(But it still needs more accessible tooling! Kompose is a good start though: https://kompose.io/)


Feel free to join the RtS discord if you want to bounce ideas for your upcoming infra


The K8s is unnecessary meme is perpetuated by people that don’t understand it.


True, but also, sometimes it’s not needed.


Sometimes it just feels good wearing a fig leaf around my groin, weilding a mid sized log as a crude club, & running through the jungle.

You might not need it is the kernel of doubt that can undermine any reasonable option. And it suggests nothing. Sure, you can go write your own kernel! You can make your own database! You might not need to use good well known proven technology that people understand and can learn about online! You can do it yourself! Or cobble together some alternate lesser special stack that just you have distilled out.

We don't need civilization. We can go it alone & do our own thing, leave behind shared frames of references. But damn, it just seems so absurdly inadvisable, and it feels so overblown the fear uncertainty & doubt telling us Kubernetes is hard and bad and too much. This article does certainly lend credence to the idea that Kubernetes is complex, but there's so many simpler starting places that will take many teams very far.


Somehow kubernetes and civilization just aren't in the same category of salience to me. Like I think it's reasonable to say that kubernetes is optional in a way which civilization isn't.

Like maybe one of those things is more important. than, the other


I don't disagree, and there's plenty of room for other competitors to arise. We see some Kamal mentions. Microsoft keeps trying to make Dapr a thing, godspeed.

But very few other options exist that have the same scope scale & extensibility, that allow them to become broadly adopted platform infrastructure. The folks saying you might not need Kubernetes, in my view, do a massive disservice by driving people to fragmentedly piece by piece constructing their own unique paths, rather than being a part of something broader. In my view theres just too many reasons why you want your platform to be something socially prevalent, to be well travelled by others too, and right now there are few other large popular extensible platforms that suit this beyond Kubernetes.


If they don't understand it but still get their jobs done...

Tractors are also unnecessary. Plenty of people grow tomatos off their balcony without tractors.

If somebody insists on growing 40 acres of tomatos without a tractor because tractors aren't necessary, why argue with them? If they try to force you to not use a tractor, that's different.


k8s is relatively straightforward, it's the ecosystem around it that is total bullcrap, because you won't only run k8s, you will also run Helm, a templating language or an ad-hoc mess of scripts, a CNI, a CI/CD system, operators, sidecars, etc. and every one of these is an over-engineered buggy mess with half a dozen hyped alternatives that are in alpha state with their own set of bugs.

How Kubernetes works is pretty simple, but administering it is living a life of constant analysis paralysis and churn and hype cycles. It is a world built by companies that have something to sell you.


Just had an incident call last week with 20+ engineers on zoom debugging a prod k8s cluster for 5 hours.


> The results were a 90% reduction in costs and a 30% improvement in performance.

I am in a company with dedicated infra team and my CEO is a infra enthusiastic. He use terraform and k8s to build the company's infra. But the results are.

- Every deployment take days, in my experience, I need to woke for 24 hr streak to make it work. - The infra is complicated to a level that quite hard to adjust

And benefits wise, I can't even think about it. We don't have many users so the claimed scalability is not even there.

I will strongly argue startup should not touch k8s until you have fair user base and retention.

It's a nightmare to work with.


sounds like your CEO just isn't very good at setting up infra.


Maybe, that is one of the possibilities in my mind too.


DAYS??? our infra takes 10 min usually with up to 45 min if we're doing some postgres maintenance stuff. People in a work context should stick to what they are good at.


...but why? How many services the deployment requires?


i got excited until i saw this was kubernetes. you most certainly do not need to add that layer of complexity.

If I can serve 3 million users / month on a $40/month VPS with just Coolify, Postgres, Nginx, Django Gunicorn without Redis, RabbitMQ why should I use Kubernetes?


Coolify does look nice.

But I don't believe it supports HA deployments of Postgres with automated failover / 0 downtime upgrades etc?

Do they even have built in backup support? (a doc exists but appears empty: https://coolify.io/docs/knowledge-base/database-backups)

What makes you feel that Coolify is significantly less complex than Kubernetes?


> why should I use Kubernetes

You shouldn't, but people have started to view Kubernetes as a deployment tool. Kubernetes makes sense when you start having bare metal workers, or high number of services (micro-services). You need to have a pretty dynamic workload for Kubernetes to result in any cost saving on the operations side. There might be a cost saving if it's easier to deploy your services, but I don't see that being greater than the cost of maintaining and debugging a broken Kubernetes cluster in most case.

The majority of uses does not require Kubernetes. The majority of users who think they NEED Kubernetes are wrong. That's not to say that you shouldn't use it, if you believe you get some benefit, it's just not your cheapest option.


Got a bill from usd10k to usd0.5k a month by moving away from gcp to Kamal in ovh

And 30% less latency


thats 95% in savings!!!! bet you can squueze more with hetzner

to ppl who disagree,

what business justifies 18x'ing your operating costs?

9.5k USD can get you 3 senior engineers in Canada. 9 in India.


Senior Engineers cost ~$3k a month in Canada?? Seems far-fetched..


We must have very different definitions of senior engineer from the GP, because I’d put the monthly cost of a senior engineer closer to $30k than $3k, even on a log scale.

Employing people requires insurance, buildings, hardware, support, licenses, etc. There are lower cost locations, but I can’t think of a single market on earth where there is a supply of senior engineers that cost $3k/month. And I say this being familiar with costs in India, China, Poland, Spain, Mexico, Costa Rica, and at least a dozen other regions.


> even on a log scale.

Log scale is just going to distort the picture in favor of your argument nor against it (10 is closer to 3 than to 30, but in log scale 10 becomes closer to 30) so I don't really understand why you're adding that here.

Also having hired senior software engineers in Europe (France, Germany, Nederlands), if it cost you 30k a month in India or Poland, you're just being conned. “Hardware, support and license” are just bogus argument, as it's completely negligible unless you're doing exotic stuff requiring expensive licenses for your engineer. “Insurance” costs pretty much nothing in most of Europe because health insurance is mostly covered by the state, and “buildings” is mostly a self-inflicted wounds nowadays, especially since you'd get the better candidates if you supported full working from home.

The original 3k is indeed way too low, even for a junior developer, but 30k is equally ridiculous really as you should never really spend more than half of that outside of the US.


Why do you need Coolify?


I'm glad we are starting to lean into cloud-agnostic or building back the on-prem/dedicated systems again.


"Join the Discord server"? Who's the audience of this project?


Genuinely curious, what's wrong with that? Did you expect a different platform like Slack?


Locking knowledge behind something that isn’t publicly searchable or archivable works fine in the short term but what happens when Discord/Slack/whatever gears up for an IPO and limits all chat history to 1 week unless you pay up (oh and now you have a bunch of valuable knowledge stored up their with no migration tool so your only options are “pay up” or lose the knowledge).


No-one complained when projects had IRC channels/servers, which are even worse since they have no history at all.


At least people treat IRC as ephemeral and place all documentation elsewhere. People are writing whole wikis inside of Discord that are not publicly searchable.


Good projects still do rely on IRC -- Libera.chat is full of proper channels -- and logging bots are ubiquitous.


And you never hear anyone complaining about those. "Locking knowledge" was never an argument before and it's not now.


All IRC clients have local plain text logging and putting a .txt on a web server is trivial.


Local logging doesn't help much for searchability when you're new and it requires you to be online 24/7. Anyway, that's beside the point. Even if IRC had built-in server history it still has the same problems but I never saw people being outraged about it.


What’s recommended here? Self-hosted Discourse?


Github issues or discussions. Or some other kind of forum like Discourse as you mentioned


Matrix and a wiki would solve the community and knowledge base issues.


Matrix has severe UX issues which drastically limit the community willing to use it on a regular basis.


historically, yes. matrix 2.0 (due in two weeks) aims to fix this.


There's a whole FOSS ecosystem of chat/collaboration applications, like Mattermost and Zulip; there's Matrix for a federated solution, and tried-and-true options like IRC.

For something called "Reclaim the Stack" to lock discussion into someone else's proprietary walled garden is quite ironic.


it would be better at the bottom of the first documentation page, after the reader has a better idea of what this is


Also noticed this. Everytime I see a project using discord as main communication tool it makes me think about the “fitness” of the project in the long run.

Discord is NOT a benefit. Its not publicly searchable and the chat format is just not suitable to a knowledge base or support based format.

Forums are much better in that regard.


> Discord is NOT a benefit. Its not publicly searchable and the chat format is just not suitable to a knowledge base or support based format.

I don't think people who choose Discord necessarily care about that. Discord is where the people are, so that's where they go. It also costs close to nothing to setup a server and since it has a lower barrier of entry than hosting your own forum, it's deemed good enough.

That said, modern forum software like Discourse https://www.discourse.org/ or Flarum https://flarum.org/ can be pretty good, though I still miss phpBB.


> Discord is where the people are, so that's where they go.

That doesn't sound right. Each Discord community is its own separate space -- you still need people to join your specific community regardless of whether it is hosted on Discord or something better.

> though I still miss phpBB.

It hasn't gone away -- the last release was on August 29th, so this is still very much a viable option.


It's all in one app and the app has a ton of users. Anyone running the app can join any server with a click of a button. There are no separate accounts required to join different communities.

So communities being separate "spaces" doesn't create any meaningful friction with regards to adoption.


People that don't like wasting money?


Not capturing the information and being able to use it in the future is a huge opportunity cost, and idling on discord pays no bills.


Wasting money on... better solutions that are also free?


There is a difference between wasting and spending money.


Seems like a cool premise. Though I guess people building things always want to convince you they are worth-it (sort of a conflict-of-interest), would like to read an unbiased 7-day migration to this.


Heroku and Reclaim are far from the only two options available. The appropriate choice depends entirely on the team's available expertise and the demands of the applications under development.

There's a lot of disagreements pitting one solution against another. Even if one hosting solution were better than another, the problem is there are SO MANY solutions that exist on so many axis of tradeoffs, it's determine an appropriate solution (heroku, reclaim, etc) without consideration to its application and context of use.

Heroku has all sorts of issues: super expensive, limited functionality, but if it happens to be what a developer team knows and works for their needs, heroku could save them lots of money even considering the high cost.

The same is true for reclaim. _If_ you're familiar with all of the tooling, you could host an application with more functionality for less money than heroku.


This looks great! Thank you for sharing, @dustedcodes. I might set up a playground to gain more hands-on experience w/ the relevant significant parts (k8s, argocd, talos) all of which have been on my radar for some time... Also, the docs look great. I love the Architecture Decision Records (bullet-point pros/cons/context)...


Based on the title alone, I thought this was going to be people up in arms about -fomit-frame-pointer being used by distros


Porter (https://www.porter.run/) is a great product in the same vein (e.g. turn K8s into a dev-friendly Heroku-like PASS). How does this compare?


I think the very concept of this is to open source a common stack, instead of relying on a middleman like Porter, which also costs a TON of money at business tier


> Replicas are used for high availability only, not load balancing

(From https://reclaim-the-stack.com/docs/platform-components/ingre...)

An I reading this right that they built a k8s-based platform where by default they can't horizontally scale applications?

This seems like a lot of complexity to develop and maintain if they're running applications that don't even need that.


This documentation only pertains to the Cloudflared ingress servers, which can handle orders of magnitude more traffic than we actually get. So we have not had any need to look into load balancing of this part of the infrastructure. Our actual application servers can of course be horizontally scaled.

That said, there is some kind of balancing across multiple cloudflared replicas. But when we measured the traffic Cloudflare sent ~80% of traffic to just one of the available replicas.

We haven't looked into what the actual algorithm is. It may well be that load starts getting better distributed if we were to start hitting the upper limits of a single replica.

Or it may be by design that the load balancing is crappy to provide incentive for Cloudflare customers to buy their dedicated Load Balancing product (https://developers.cloudflare.com/load-balancing/).


A trajectory question: Is there an acceptable solution to federate k8s clusters, or is there a such need? One thing that EC2 was really powerful is that a company can practically create as many clusters (ASGs) of as many nodes as needed, while k8s by default has this scale limit of 5000 nodes or so. I guess 5000 nodes will far from being enough for a large company that offers a single compute platform to its employees.


Who are your target audience? There are so many components in this system, so it would require a dev-ops team member just to keep it healthy.

What are the advantages over the (free) managed k8s provided by DigitalOcean?

---

Gosh, I'm so happy I was able to jump of the k8s hype train. This is not something SMBs should be using. Now I happily manage my fleet of services without large infra overhead via my own paas over Docker Swarm. :)


> Who are your target audience?

Anyone looking for a PaaS alternative matching or exceeding the UX of Heroku.

The "is it for you" section of our Introduction may give a better idea: https://reclaim-the-stack.com/docs/kubernetes-platform/intro...

> What are the advantages over the (free) managed k8s provided by DigitalOcean?

You can run the platform on top of any Kubernetes deployment. So you can run it on top of DigitalOcean kubernetes if you wish. But you'll get more bang for the buck using Hetzner dedicated servers.


I've read the Introduction, but still have no idea why I need to use this platform instead of a managed k8s provided by DO.

It probably makes sense to put a few words on the "components" as well, as it seems to be the main selling point and not the privacy/GDPR concerns.


Oh, thanks for asking. ;)

It is a fair source (future Apache 2.0 License) PaaS. I provide a cloud option if you want to manage less and get extra features (soon - included backup space, uptime monitoring from multiple locations, etc) and, of course, you are free to self-host it for free and without any limitations by using a single installation script. ;)

https://github.com/ptah-sh/ptah-server

But anyway, I'm really curious to know the answers to the questions I have posted above. Thanks!


> Gosh, I'm so happy I was able to jump of the k8s hype train. This is not something SMBs should be using. Now I happily manage my fleet of services without large infra overhead via my own paas over Docker Swarm. :)

I mean, I also use Docker Swarm and it's pretty good, especially with Portainer.

To me, the logical order of tools goes with scale a bit like this: Docker Compose --> Docker Swarm --> Hashicorp Nomad / Kubernetes

(with maybe Podman variety of tools where needed)

I've yet to see a company that really needs the latter group of options, but maybe that's because I work in a country that's on the smaller side of things.

All that being said, however, both Nomad and some K8s distributions like K3s https://k3s.io/ can be a fairly okay experience nowadays. It's just that it's also easy to end up with more complexity than you need. I wonder if it's going to be the meme about going full circle and me eventually just using shared hosting with PHP or something again, though so far containers feel like the "right" choice for shipping things reasonably quickly, while being in control of how resources are distributed.


While k3s make k8s easier for sure, it still comes with lots of complexity on board just because it is k8s. :)

Nowaday I prefer simple tooling over "flexible" for my needs.

Enterprises, however, should stick to k8s-alike solutions, as there are just too many variables everywhere: starting from security, and ending the software architecture itself.


> Having started with Heroku, we have maintained a similar level of security

Remember 2022? https://www.bleepingcomputer.com/news/security/heroku-admits...


Toying with self hosted k8s at home has taught me that it it’s the infra equivalent of happy path coding.

Works grand until it blows up in your face for non obvious reasons

That’s definitely mostly a skill issue on my end but still would make me very wary betting a startup on it


> We spent 7 months building a Kubernetes based platform to replace Heroku for our SaaS product at mynewsdesk.com.

I thought this was either a joke I was missing, or a rant about Kubernetes. It turned out it was neither, and now I am confused.


I was excited about this title until I read it's just another thing on top of Kubernetes. To me, Kubernetes is part of the problem. Can we reduce the complexity that Kubernetes brings and still have nice things?


> We spent 7 months building a Kubernetes based platform to replace Heroku for our SaaS product

And heroku is based on LXC containers. I'd say it's almost the same thing.


What about “the rest of us” who don’t have time for Kube?


If you know how to write a docker-compose.yml – Docker Swarm to the rescue! I’m making a nice PaaS-style thing on top of it: https://lunni.dev/

You can also use Kubernetes with compose files (e.g. with Kompose [1]; I plan to add support to Lunni, too).

[1]: https://kompose.io/


Im using docker compose on every project I have, and it works fine.

Of course, I dont have millions of users, but until then this is enough for me.


"The results were a 90% reduction in costs and a 30% improvement in performance."

What's the scale of this service? How many machines are we talking here?


Went from $~7500 to $520/m iirc from the presentation


> 90% reduction in costs

Curious what accounts are being attributed to said costs.

Many new maintenance-related lines will be added, with only one (subscription) removed.


Definitely interesting material. I realized, especially in last few years, there is an increased interest on moving away from propriety clouds/PaaS to K8s or even to bare metal, primarily driven by high prices and also interest of having more control.

At Ubicloud, we are attacking the same problem, though from a different angle. We are building an open-source alternative to AWS. You can host it yourself or use our managed services (which are 3x-10x more affordable than comparable services). We already built some primitives such as VMs, PostgreSQL, private networking, load balancers and also working on K8s.

I have a question to HN crowd; which primitives are required to run your workloads? It seems the OP's list consists of Postgres, Redis, Elasticsearch, Secret Manager, Logging/Monitoring, Ingress and Service Mesh. I wonder if this is representative of typical requirements to run HN crowd's workloads.


Quite simple, I want to submit a Docker image, and have it accept HTTP requests at a certain domain, with easy horizontal/vertical scaling. I'm sure your Elastic Compute product is nice but I don't want to set it up myself (let alone run k8s on it). Quite like fly.io.

PS: I like what you guys are doing, I'd subscribe to your mailing list if you had one! :)


Most of the software should work Out-Of-The-Box, but the real problem is coming from hardware.


How can a NewsDesk application need kubernetes?

Wouldn't a single machine and a backup machine do the job?


Because it's a fully featured public relations platform, not just a "newsdesk" (though that's what it started as some 20 years ago).

We have a main monolithic application at the core. But there are plenty of ancillary applications used to run the various parts of our application (eg. analytics, media monitoring, social media monitoring, journalist databases, media delivery, LLM based content sugestion etc).

Then we have at least one staging deployment for each app (the monolith has multiple). All permutations of apps and environments reach about 50 applications deployed on the platform, all with their own highly available databases (Postgres, Redis, ElasticSearch and soon ClickHouse).


Most simple applications that use k8s are doing it for autoscaling or no downtime continuous deployment (or both).


So basically 2 things you don’t need k8s to solve?


You don't need anything. You choose the most convenient tool according to your professional judgment. I certainly hope that nobody is using Kubernetes because they are against the wall, and instead decide to use it for its features.


What would you use to solve these problems?


VMs and load balancers?

From the documentation on the site it says that they're running on dedicated servers from Hetzner... So they aren't auto-scaling anything, they are paying for that hardware 24/7. It makes absolutely no difference what the number of running containers are, the cost remains constant.


Is business running a complete application stack on a single machine?


A lot of businesses don’t need more than a couple machines (and can get away with one, but it’s not good for redundancy).


Frequently yes, normally I'd say that the database server is on a separate machine, but otherwise yes.

I've seen companies run a MiniKube installation on a single server and run their applications that way.


my vps reboots every 18 months or so..


I just looked it up - its because they run Ruby On Rails.


and so what?


Ruby On Rails is well known for not being at the fast end of the spectrum, so it needs lots of machines, and lots of machines gives reason to user Kubernetes.

A NewsDesk application written in something compiled for example golang would be much faster and likely could run on a single server.

The benefit of single server being you don't need kubernetes and can spend that development resource on developing application features.


In theory, from a performance point of view, we could easily run our main Rails monolith on a single server.

One does not chose single server deployments when reaching for four nines of uptime though. We also run a lot of ancillary applications / staging environments etc which is what warrants a cohesive deployment platform.

More context here: https://news.ycombinator.com/item?id=41492697


I personally prefer Go to Rails, but let’s be real here: the market cap of Rails is probably, like, a hundred times the market cap of Go.


No doubt with Github, AirBnB, Shopify and other big sites RoR is bigger for the front end.

But now if lots of those sites are running on K8s with Argo CD or something or on a cloud platform where the infrastructure is provisioned with Terraform Go is supporting a great deal of things but it's far less visible.


I believe Crystal is worth a look here.


> fully open source stack*. *) Except for Cloudflare

Are there plans to address that too long term?


Not from our point of view since Cloudflare's DDOS production and CDN is a crucial part of our architecture.

That said, switching out cloudflared for a more traditional ingress like nginx etc would be straight forward. No parts of the RtS tooling as actually dependent on using Cloudflare for ingress in particular.


having your tool be a single letter, k, seems rather presumptuous.


Especially given K is already the name of an APL derivative.


I suppose it is. But no actual users of the platform has had any complaints about it.


Potential irony, this site isn't loading for me




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: