Hacker News new | past | comments | ask | show | jobs | submit login
We're Leaving Kubernetes (gitpod.io)
517 points by filiptronicek 80 days ago | hide | past | favorite | 333 comments



Personally - just let the developer own the machine they use for development.

If you really need consistency for the environment - Let them own the machine, and then give them a stable base VM image, and pay for decent virtualization tooling that they run... on their own machine.

I have seen several attempts to move dev environments to a remote host. They invariably suck.

Yes - that means you need to pay for decent hardware for your devs, it's usually cheaper than remote resources (for a lot of reasons).

Yes - that means you need to support running your stack locally. This is a good constraint (and a place where containers are your friend for consistency).

Yes - that means you need data generation tooling to populate a local env. This can be automated relatively well, and it's something you need with a remote env anyways.

---

The only real downside is data control (ie - the company has less control over how a developer manages assets like source code). I'm my experience, the vast majority of companies should worry less about this - your value as a company isn't your source code in 99.5% of cases, it's the team that executes that source code in production.

If you're in the 0.5% of other cases... you know it and you should be in an air-gapped closed room anyways (and I've worked in those too...)


And the reason they suck is the feedback loop is just too high as compared to running it locally. You have to jump through hoops to debug/troubleshoot your code or any issues that you come across between your code and output of your code. And it's almost impossible to work on things when you have spotty internet. I haven't worked on extremely sensitive data but for PII data from prod to dev, scrubbing is a good practice to follow. This will vary based on the project/team you're on of course.


Aka 'if a developer knew beforehand everything they needed, it wouldn't be development'


That's the least important problem.

The developers also lack knowledge about the environment; can't evolve the environment; can't test the environment for bugs; and invariably interfere with each other because it's never isolated well. And also, yes, it adds lag.

Anyway, yes, working locally on false data that little resemblance to production still beats remote environments.


We tried this approach at a former company with ~600 engineers at the time.

Trying to boot the full service on a single machine required every single developer in the company installing ~50ish microservices on their machine, for things to work correctly. Became totally intractable.

I guess one can grumble about bad architecture all day but this had to be solved. we had to move to remote development environments which restored everyone’s sanity.

Both FAANG companies I’ve worked at had remote dev environments that were built in house.


> Trying to boot the full service on a single machine required every single developer in the company installing ~50ish microservices on their machine, for things to work correctly. Became totally intractable.

This is certainly one of the critical mistakes you did.

No developer needs to launch half of the company's services to work on a local deployment. That's crazy, and awfully short-sighted.

The only services a developer ever needs to launch locally are the ones that are being changed. Anything else they can consume straight out of a non-prod development environment. That's what non-prod environments are for. You launch your local service locally, you consume whatever you need to consume straight from a cloud environment, you test the contract with a local test set, and you deploy the service. That's it.

> I guess one can grumble about bad architecture all day but this had to be solved.

Yes, it needs to be solved. You need to launch your service locally while consuming dependencies deployed to any cloud environment. That's not a company problem. That's a problem plaguing that particular service, and one which is trivial to solve.

> Both FAANG companies I’ve worked at had remote dev environments that were built in house.

All FANG companies I personally know had indeed remote dev environments. They also had their own custom tool sets to deploy services locally, either in isolation or consuming dependencies deployed to the cloud.

This is not a FANG cargo cult problem. This is a problem you created for yourself out of short-sightedness and for thinking you're too smart for your own good. Newbies know very well they need to launch one service instance alone because that's what they are changing. Veterans know that too well. Why on earth would anyone believe it's reasonable to launch 50 services to do anything at all? Just launch the one service you're working on. That's it. If you believe something prevents you from doing that, that's the problem you need to fix. Simple. Crazy.


> You launch your local service locally, you consume whatever you need to consume straight from a cloud environment, you test the contract with a local test set, and you deploy the service. That's it.

If your services are mostly stateless and/or your development team is very small that can work. If not, you will quickly run into problems sharing the data. Making schema changes to the shared cloud services. Cleaning up dev/test/etc data that has accumulated, etc. Then you are back to thinking of provisioning isolated cloud environment per dev.


What was the point of microservice architecture if you can't develop each service individually in the first place? Sounds to me like the architecture you're talking about isn't an actual microservice, and it's just ball 'o mud over TCP instead of as a single monolith.

At a previous place of work I worked with a monolith structure, and it was actually perfectly fine. Development got done separately on several large substructures in the monolith, and devs could install the whole project locally and run it just fine.

I'm really wondering why we're all using microservice architecture if we're all convinced that to actually develop on them, devs need to reproduce 50odd of those services locally for debugging. Then what was the point?


> Then what was the point?

Resume chasing and trying to paper over the fact that you don't understand architecture, plus bad tooling that e.g. doesn't properly support incremental compilation and so makes monoliths painful.


> Resume chasing (...)

No one does microservices for resume chasing anymore, because everyone is already doing it for practical reasons. I never came across a monolith that wasn't in the process of peeling responsibilities to either microservices or function-as-a-service. For project managers to open their eyes, all that's needed is something like a deployment going wrong due to a single bad commit, or things scaling weird because a background process caused a brownout, or even external teams screwing up a deployment after pushing bad code.


You can always spin up several services locally or if you have a development cluster run the service you are working on locally against development services.


You're going in circles. That is what the commenter is replying to. You often can't just go off dev because other people use it while you're testing, and you're back to just launching everything yourself.


> You often can't just go off dev because other people use it while you're testing.

Why do you think that other people using a cloud environment prevents you from using the environment?


> If your services are mostly stateless and/or your development team is very small that can work. If not, you will quickly run into problems sharing the data.

No, not really. The size of your teams have zero to do with whether your services can corrupt data. That's on you. Don't pin the blame on a service architecture for the design errors you introduced. Everyone else does not have that problem. Why are you having it and blaming the same system architecture used by everyone else?

> Making schema changes to the shared cloud services.

What are you talking about? There is absolute zero system architectures where you can change schemas willy nilly without consequences. Why are you trying to pin the blame on microservices for not knowing the basics of how to work with databases?

In all the times I had to work on schema changes, the development was done with a db deployed in a sandbox environment, and when work was done we had to go with a full database migration with blue-green deployments along with gradual rollout. Why on earth are you expecting you can just drop by and change a schema?

> Cleaning up dev/test/etc data that has accumulated, etc.

Isn't this a non-issue to anyone who works with databases? I mean, in extreme scenarios you can spin up a fake database with your local service, but don't even try to argue this is the justification you need to launch dozens of services.

The truth of the matter is that this is extremely simple: if you want to work on a service, launch that service locally configuring it to consume all dependencies running in a non-prod environment. That's what they are for. If you have extremely specialized needs, stub specific dependencies locally. That's it.


Ha. It almost looks like either the fan boys or PR department is making up use-cases.

Yes what you're saying is correct, but why many words when few do trick:

This hypothetical IT department isn't able to host its own development environment, yet suddenly they do have the skills if they switched to gitpod.


I’m sure there are legion ways to do this. For our company, it was Tilt + Telepresence. Locally, we ran our local service with Tilt. The service we ran would often need access to some other service(s), which would have been hard/impossible to bring up in conjunction with their own rabbit hole of dependencies. Thus, those dependency services were made available to the local k8s installation via Telepresence. That way, locally running services could access their dependent services as if they were running in the local environment (even though they were actually running in the cloud). In our case, security teams would only allow us stage environments, but that was enough for most everything.


I may be misunderstanding, but wouldn't you want the particular microservice you are working on independent enough to develop locally, then deploy into the remote environment to test the integration? (I don't work at this scale)


Yes, exactly.

I just also like to have an option to run service locally and connect to either cloud instances (test) or local instances depending on what I am troubleshooting/testing. Much better than debugging on prod which may still be required at some point but hopefully not often.


This is how you get to "I wrote to the spec, it's your problem that clicking the button doesn't do the thing". Huge feedback loops. When you run it "locally" enough you can do your integration stuff before you even ask for review.


> This is how you get to "I wrote to the spec, it's your problem that clicking the button doesn't do the thing".

No, not really. You only find yourself in that spot if you completely failed to do any semblance of integration test, or any acceptance test whatsoever.

That's not a microservices problem. That's a you problem.

You talk about feedback look. Other than automated tests, what do you believe that is?


That doesn't seem possible to me. If you have a feature that involves 10 teams and 10 services, nothing will actually work until the 10th change is made (assuming everything was done perfectly).


If you have one team per service, yes. In many companies, you may have one team and 10 services though. I wish I was making this up.


Invariably this is an ideal and does not match up in reality. I work at ~50 ish employee company and we have layers of dependencies between at least 6 or 7 various microservices. I can see this adding up in complexity as the product scales


> Invariably this is an ideal and does not match up in reality.

No, this does indeed match reality. At least for those who work with microservices. This is microservices 101. It's baffling how this is even being argued.

We have industry behemoths building their whole development experience around this fact. Look at Microsoft. They even went to the extents of supporting Connected Services in Visual Studio 2022. Why on earth do you believe one of the most basic traits of backend development is unreal?

> I work at ~50 ish employee company and we have layers of dependencies between at least 6 or 7 various microservices.

Irrelevant. Each service has dependencies and consumers. When you need to run an instance of one of those services locally, you point it to it's dependencies and you unplug it from it's consumers. Done. This is not rocket science.


You can't compare Microsoft to your run-of-the-mill small (or even large) software shop, though. Maybe on HN, most people work on these amazingly designed systems, but in my experience most tech out there is shit and has no proper design or architecture beyond "we're doing microservices because everyone is".


> You can't compare Microsoft to your run-of-the-mill small (or even large) software shop, though.

True. Your run-of-the-mill shop should have a simpler and more straight-forward system.

But you seem to want the reverse.


No one wants the reverse, I would love if my microservices were perfectly isolated little boxes with known inputs and outputs! That would make my life easier. But I don’t have the ownership over the planning process and our sales person already told our customer we’d have the new feature they asked for that no one on the engineering team knew about delivered by next sprint. It would be nice if my company planned things well! But they don’t


So then it’s bad engineering being wagged by Sales, not some expected sane choice.


Just s/Sales/Management, because sales actually can't wag the development process. But yeah.


Most of the time it's bad engineering caused by other engineers.


> You can't compare Microsoft to your run-of-the-mill small (or even large) software shop, though.

I'm talking about how Microsoft added support for connected services to Visual Studio. It's literally a tool that anyone in the world can use. They added the feature to address existing customer needs.


Apart from the fact that not everyone uses Visual Studio, "connected services" appears to be something by which you can connect to existing cloud-based services.

How does that solve the problem of a mess of interconnected services where you may have to change 3 or more of them simultaneously in order to implement a change?


> Apart from the fact that not everyone uses Visual Studio, "connected services" appears to be something by which you can connect to existing cloud-based services.

Yes. That's the point.

> How does that solve the problem of a mess of interconnected services (...)

I don't think you got the point.

The whole point is that you only need to connect your local deployment to services that are up and running. There is absolutely no need to launch a set of ad-hoc self-contained services to run a service locally and work on it. That is the whole point.


Your whole argument boils down to "don't write shit software" which yeah, fair, but in the real world, the company that you just joined has shit code that evolved over 10 years and has accumulated all sorts of legacy cruft. The idea that there is "absolutely no need to launch a set of ad-hoc self-contained services to run a service locally and work on it" just doesn't match the reality of most places I've worked at. You either got very lucky or you didn't work on complex enough systems.


> Your whole argument boils down to "don't write shit software" (...)

No. My whole argument is open your eyes, and look at what you're doing. Make it make sense.

Does it make sense to launch 50 instances locally to be able do work on a service? No. That's a stupid way of going about a problem.

What would make sense? Launch the services you need to change, of course. Whatever you need to debug, that's what you need to run locally. Everything else you consume it from a cloud environment that's up and running.

That's it. Simple.

If there's something preventing you from doing just that then that's an artiicifal constraint that you created for yourself, and thus that you need to fix. We're talking about things like auth. Once you fix that, go back to square one.


Just FYI, you come across as extremely antagonistic in the way you're conveying your message. The underlying tone seems to be "you're stupid".


Shouldn't all systems be tested end-to-end regardless of if they are microservices or not?


You don’t have to test them all end to end before merging a PR. You should have multiple stable pre prod environments for e2e testing. But if most changes fail e2e testing then your sdlc is broken before then and that should be fixed first. You need better designs and better collaboration and better local tests and code reviews.


> You don’t have to test them all end to end before merging a PR.

You have to test the changes you want to push. That's the whole basis of CI/CD. The question is at which stage are you ok with seeing your pipeline build.

If you accept that you can block your whole pipeline by merging a bad PR then that's ok.

In the meantime, it is customary to configure pipelines to run unit, integration tests, and sometimes even contract tests when creating a feature branch. Some platforms even provide high-level support for spinning up sandbox environment as part of their pipeline infrastructure.


So basically a distributed monolith :P


I'm sure companies with well designed and properly isolated services exist but... in my time spent at several companies, "microservices" invariably degenerate to distributed monoliths.


If the microservice has dependencies on other services it is not a microservice.


According to whom? How do those microservices get anything done if they just live in their own isolated world where they can't depend on (call out to) any other microservice?


Would anyone care to explain the reasoning behind their down votes?


Connecting to a messaging queue or database count as a dependency?

Why not break a microservice into a series of microservices, its microservices all the way down.


Only if you cannot change one service without changing the other simultaneously. It's fine to have evolving messages on the queue but they have to be backwards compatible with any existing subscribers, because you cannot expect all subscribers to update at the same time. Unless you have a distributed monolith in a monorepo, but at least be honest about it.

Multiple services connecting to the same database has been considered a bad idea for a long time. I don't necessarily agree, but I have no experience in that department. It does mean more of your business logic lives in the database (rules, triggers, etc).


> Only if you cannot change one service without changing the other simultaneously.

Not true at all.

You're conflating the need for distributed transactions with the definition of microservices. That's not it.

> Multiple services connecting to the same database has been considered a bad idea for a long time.

Not the same thing at all. Microservices do have the database per service pattern, and even the database instance per service instance pattern, but shared database pattern is also something that exists in the real world. That's not what makes a microservice a microservice.


> If the microservice has dependencies on other services it is not a microservice.

You should read up on microservices because that's definitely not what they are not anything resembling one of their traits.


Yes, the real world trumps theory, hence my question.


Reminds me of a favorite quote: "What's the difference between between theory and practice (reality)? In theory, they're the same."


That's ~12 people per microservice. I understand that it might be hard to boot and coordinate all that on each developers machine, and that local docker images/vms/whatever might not suit you, but is it possible that this is a problem of your companies own creation?


It's very likely not their decisions that lead to this, but their responsibility to improve velocity.

Imagine the goal is to fix the problems (e.g. make it possible to run less of the services or something like that): How do you do that without first running all the services, making the proper changes, and then testing those changes? You need to be able to run all the services in that interim period.

So, wouldn't it be nice if there were a solution for this in-general? And, maybe, it would lead to better conditions later on. But in the meantime there is really no way around the existing design/decisions/etc. You simply have to deal with that reality and engineer around it.


> How do you do that without first running all the services (...)

Why do you need to run all services in isolation to be able to troubleshoot and isolate a problem?


I think it's a fair to assume OP would have tried to run only some of the services and has seen or experienced problems with doing that.


> It's very likely not their decisions that lead to this

Yeah, I get that, I was deliberate about the phrasing of "your company" rather than just "your".

Obviously we don't know anything about the parent commenters company and situation, perhaps 12 people per microservice genuinely is the right solution for them, but it seems like it would be better not to get into this situation in the first place, though once there you obviously have to tackle the problem as it presents itself.


Agreed that it may be the right solution, but it smells bad. If I was there I would be trying to reduce the complexity where I could.


> installing ~50ish microservices on their machine

Ouch. Where they using macOS at the time with laptops having not-enough-ram?

I've seen that go poorly on macOS with java based microservices. Largely due to java VMs wanting ram pre-assigned for each, which really chews though ram that mostly sits around unused.

This was a few years ago though, at the tail end of Intel based mac's where 32GB ram in a mac laptop wasn't really an option.


I bet it's less of a RAM issue, and more of an orchestration problem. Making sure you have the latest version of every microservice and it's configuration.

"Oh it's not running locally, you need to also run service_18_v2.js, and include the right env variables"


If you have to have the latest version etc of every microservice, it's a distributed monolith.


Right, the crux here is bad architects are building distributed systems that are essentially monoliths. It's the worst of both worlds. You get none of the guarantees and visibility of a monolith, and you get all the friction of a distributed system.


This is the first time I've heard the term distributed monolith and it's just clicked that every microservice application I've worked on has been exactly that.


No, it was definitely a ram issue in this case. The laptops had 16GB of ram, the maximum available at the time. With the java VM overhead that ran things straight into swap and then some.

Running the dev environments remotely (or rewriting in Go) were the options being considered before the whole project was canned and people redistributed to other things.


Given some Kubernetes workloads, even writing everything in Assembly might not help.


I do the great majority of my work on a beefy desktop with a phat GPU, which I mostly run via ssh and a browser.


This + k8s magic that seem to make thing coupled tighter.


> This + k8s magic that seem to make thing coupled tighter.

I think nobody really talks about this, but unless you have a docker-compose.yml that includes everything you need for local development, it's increasingly more likely that you'll end up coupling things to Kubernetes to such a degree that running without it (and its abstractions) will become more effort than a person can muster.

So while people try to create services that are decoupled from one another, they end up instead coupling them to Kubernetes concepts and a service mesh, service discovery, configuration and secret management mechanisms, persistent storage abstractions and so on.

Which you can obviously do if you want to, but which might make running things locally in a minimalistic fashion that much more complex.

It's the same as with for example using a web server as a reverse proxy for my applications and ending up putting some logic in there (e.g. route rewrites, headers etc.) and then realizing that I must also run a similar web server locally for 1:1 compatibility because something like Vue dev server proxy to the locally running API won't be able to give me all that.


> I think nobody really talks about this, but unless you have a docker-compose.yml that includes everything you need for local development (...)

Is this a problem?

I mean, in this scenario docker compose serves two main purposes: launch a few mock services, and configure those services according to your needs. This means configuring them to consume services already deployed to a cloud environment of your choice. This is something you control.

Then all that's left is the service (or set if services) you need to modify. You run those locally and configure them to consume a mix of the mocked services and the services deployed to a cloud environment.

> (...) it's increasingly more likely that you'll end up coupling things to Kubernetes to such a degree that running without it (and its abstractions) will become more effort than a person can muster.

That coupling can only happen if you intentionally add it yourself.

If you need to run your services in isolation, you will be more mindful of the need to not introduce that sort of coupling.


    > Largely due to java VMs wanting ram pre-assigned for each
Do you mean the JVM min heap size was rather later? Otherwise, there is no need for a very large min heap size on a modern JVM (11+).


Pretty sure it was a mix of java 8 (yeah, very outdated even then) and more modern stuff.


why would run multiple JVMs?


One per microservice.


This is a bad idea. It defeats a major feature of the jvm.


> This is a bad idea.

Not really. It all depends on what are your needs.

> It defeats a major feature of the jvm.

You're confusing things. Just because Java addressed the deployability problem for Java applications before containerization was even a word, this does not mean that deploying a JVM per service is a bad idea. Just think about it for a second. Why do you need to deploy and scale services independently? Do you mention the JVM in your response? No.


We were talking about dev enviroments on desktops, not independently scalable components.

I did mention the jvm in my response. You quoted me doing so, in fact.


> We were talking about dev enviroments on desktops, not independently scalable components.

Your dev environment is expected to mimick your production environment, not the other way around.


You need an option to enable only the services they need to develop locally and automatically configure them to talk to the other services hosted elsewhere.


What ever happened to make? It seems like so much of what developers do now just duplicates what could be done with a single makefile that pokes around, figures out what you need, and makes a script to set it up.


Make is horrible though. It works, but it’s horrible. Hardly surprising people have looked for alternatives, it’s just a shame that most of those are horrible too.


"Use make" advices start so sound a lot like fetish and disregard the actual challenges.

Make is just a framework for you to do your builds. Sure you can cram anything into it, but that is exactly the kind of area that other tools like Ansible or even Terraform shine.

Make isn't a silver bullet.

EDIT: Just to make sure, I'm using fetish as something you spend an unreasonable amount of time with it.


> Both FAANG companies I’ve worked at had remote dev environments that were built in house.

This is certainly not universal among FAANGs though.

Requiring 50 services to be up is absolutely nuts, but it’s actually pretty trivial using something like Nomad locally.


In a past job I've had a good experience in this case with docker compose (well, something similar).

You would list the services you need (or service groups) in a config file, start a command and all services would start in containers. Sure, you need a lot of RAM with that but on 32Gb it was working fine.


> but on 32Gb it was working fine.

So 64 G to include the ram for the ide and web browsing / javascript apps like slack?


Most teams/products I have been involved in, the stack always grows to the point that a dev can no longer test it on their own machine, regardless of how big the machine is. And having a different development machine than production leads to completely predictable and unavoidable problems. Devs need to create the software tooling to make remote dev less painful. I mean, they're devs... making software is kind of their whole thing.


I have used remote dev machines just fine, but my workflow vastly differs from many of my coworkers: terminal-only spacemacs + tmux + mosh. I have a lot of CLI and TUI tools, and I do not use VScode at all. The main GUI app I run is a browser, and that runs locally.

I have worked on developing VMs for other developers that rely on a local IDE such. The main sticking point is syncing and schlepping source code (something my setup avoids because the source code and editor is on the remote machine). I have tried a number of approaches, and I sympathize with the article author. So, in response to "Devs need to create the software tooling to make remote dev less painful. I mean, they're devs... making software is kind of their whole thing." <-- syncing and schlepping source code is by no means a solved problem.

I can also say that, my spacemacs config is very vanilla. Like my phone, I don't want to be messing with it when I want to code. Writing tooling for my editor environment is a sideshow for the work I am trying to finish.


Me as well, specially in the days that there was only a UNIX dev server for everyone.

It was never an issue to use X Windows on them, with hummingbird on my Windows thin client.

I guess a new generation has to learn the ways of timesharing development.


I am hardly a dev but occasionally have had to do some or some scripting or web stuff and have really loved VSCode and using the remote SSH support to basically feel like I’m coding locally. Does that not work for your devs?


UNIX, and other competing timesharing systems of the time, have always been remote first, with Windows catching up with Citrix, followed by RDP, and nowadays finally headless as well.

Nowadays Web frontends and SSH/cloud shell, have replaced what used to be X Windows / telnet / rsh, but the underlying workflows aren't much different than running an IDE / emacs /vi / joe /... from a UNIX development server in a 1990's office.


The funniest (?) thing to me about all this: we're still hoping, if we do things right, to replicate the technology (terminals) from 50 years ago.

I honestly don't understand why nobody has simply invented some software to solve this problem, after 50 years.


Add to it the whole TUI fashion, as if we weren't doing them already 50 years ago.

We did, it is called GUI and language REPLs, like Smalltalk and Interlisp-D development enviroments, with graphical based terminals, not dependent on replicating virtual teletypes.

Still something that seems problematic to take off the way it should.


That puts a "scary" VSCode blob on the remote-server. Some orgs do not like that, even if it's a "work" class box.


> Most teams/products I have been involved in, the stack always grows to the point that a dev can no longer test it on their own machine, regardless of how big the machine is.

It doesn't have to be like that. I've worked on a 10MLOC codebase with 500+ committers - all perfectly runnable locally, on admittedly slightly beefy dev machines. It's true that systems will grow without limit unless some force exists to counter this, but keeping your stack something you can sanely run on a development machine is well worth spending some actual effort on.


It depends on the application. I've worked at places where the main application being written needs 128GB to run. Another where it's computationally intense and will take hours for a single run on anything but the most massive machines. Another where the dataset is huge and simply being remote makes the bandwidth requirements untenable unless you live across from a data center. I've seen teams just accept their workflow will take 15 minutes each time they run their app because they're copying stuff from remote each time. Remote dev gives you as beefy a machine as you want, with the fastest network and lowest bandwidth.

But that's only the resource problem. Another problem I have seen my entire career, is devs can't keep their machines configured the same. They have different model laptops, they don't pin their app versions, they configure and install things by hand. Each time they change something by accident it takes them hours, days, sometimes weeks, to get it working again. That also can lead to bugs developing the app, which wastes a huge amount of time.

And then there's the fact that their local copy runs completely differently than it does in production. This leads to the app being written with certain assumptions about how it runs, that turn out to be false in production. I've seen this lead to catastrophe, as well as just weeks to months of wasted time, trying to track down issues. This is an undeniable, existential issue.

Finally, it's rare for local setups to be secure. Often devs get too much access from their local machines, and this is stolen by infostealer malware and compromise happens. A protected remote environment is easier to secure. A lot of development is hampered by all the crappy corporate security tools that's on laptops now. Remote dev allows you to bypass all that and have a fully working yet protected network without restrictions.

Is remote dev a pain? Right now, yeah, because nobody has made it be less painful. So of course it's easier on the local machine. But it's not ideal. I'm sure eating with a spoon was more painful than eating with your hands, until forks were popularized in the 18th century. Change took a long time, but I think most of us prefer the change once new tools became widely available.


> Remote dev gives you as beefy a machine as you want, with the fastest network and lowest bandwidth.

Yes and no. Realistically, the range between the beefiest possible remote server and the beefiest possible workstation is what, one order of magnitude? So in a growing environment doing remote dev will maybe let you kick the can down the road a year or two, but you'll still have to deal with whatever was causing your requirements to grow pretty soon.

> But that's only the resource problem. Another problem I have seen my entire career, is devs can't keep their machines configured the same. They have different model laptops, they don't pin their app versions, they configure and install things by hand. Each time they change something by accident it takes them hours, days, sometimes weeks, to get it working again.

> Finally, it's rare for local setups to be secure. Often devs get too much access from their local machines, and this is stolen by infostealer malware and compromise happens. A protected remote environment is easier to secure. A lot of development is hampered by all the crappy corporate security tools that's on laptops now. Remote dev allows you to bypass all that and have a fully working yet protected network without restrictions.

This isn't a remote versus local question, it's a question of how much control developers have over their environment and how much you manage and standardise what's installed. You can have a fully locked down local machine where developers can't install anything except a short whitelist (of course you may get some pushback) and you can have a remote VM where developers curl|sh whatever random repack of Python they wanted this week - I've seen both these things happen in practice.

Security junkware I sort of agree with you, but I think that's more of an artifact of bad laws/policies and if and when remote dev takes off we'll see just as much junkware on remote dev machines as on local ones.


We have a project which spawns around 80 Docker containers and runs pretty OK on a 5 year old Dell laptop with 16GB RAM. The fans run crazy and the laptop is always very hot but I haven't noticed considerable lags, even with IntelliJ running. Most services are written in Go though and are pretty lightweight.


> Most services are written in Go though and are pretty lightweight

That's probably the difference. Throw elasticsearch, kafka and a bunch of Java services in and you'll be easily exhausting your RAM (at least at startup).


16GB of RAM is hardly a gargantuan amount in 2024 - it's not unreasonable to expect someone running a local dev environment to have a more practical amount. I wouldn't buy (or recommend) any machine for dev work with less than 64GB in 2024.


>I wouldn't buy (or recommend) any machine for dev work with less than 64GB in 2024.

Honestly, I think it depends on what kind of languages/frameworks/tools you use. Go is very lightweight, and you can safely run hundreds of Go services under 16 GB no problem (and with a crappy CPU, too). Python/Java/PHP etc. on the other hand, are much more wasteful. Our Go shop only now is considering maybe buying 32GB dev machines...


It has OpenSearch, RabbitMQ, Redis, MySQL


> the stack always grows to the point that a dev can no longer test it on their own machine

So the solution here is to not have that kind of "stack".

I mean, if it's all so big and complex that it can't be run on a laptop then you almost certainly got a lot of problems regardless. What typically happens is tons of interconnected services without clear abstractions or interfaces, and no one really understands this spaghetti mess, and people just keep piling crap on top of it.

This leads to all sorts of problems. Everywhere I've seen this happen they had real problems running stuff in production too, because it was a complex spaghetti mess. The abstracted "easy" dev-env (in whatever form that came) is then also incredibly complex, finicky, and brittle. Never mind running tests, which is typically even worse. It's not uncommon for it all to be broken for every other new person who joins because changes somewhere broke the setup steps which are only run for new people. Everyone else is afraid to do anything with their machine "because it now works'.

There are some exceptions where you really need a big beefy machine for a dev env and tests, maybe, but they're few and far between.


That kind of mess sounds super dangerous from a production perspective too.

With things that messy it's fairly likely there would be dependency loops or problems (thundering herd, etc) trying to get things going from a cold start.

ie after a complete outage or similar for whatever reason


> So the solution here is to not have that kind of "stack".

Reminds me of my favorites debugging technique. It's super fast: Don't write any bugs!


Quite an effective approach:

https://github.com/kelseyhightower/nocode


Code is a liability. Not writing code is one of the best things a developer can do on any particular day, aside of course from deleting it ;)


What a boring trite reply. All of this is analogous to badly written spaghetti code. And yes, you can absolutely avoid all of this if you know what you're doing.


Not trite at all. Bad code and bad architectures are a reality. You can fix them in theory, but that takes a lot of time and needs to be done incrementally. In the meantime, you have to live with the problem at hand.


> the stack always grows to the point that a dev can no longer test it on their own machine

Sounds like you have a different problem.

CPU resources required to run your stack should be very minimal if it's a single user accessing it for local testing idle threads don't consume oodles of cpu cycles to do nothing.

Memory use may be significant even in that case (depending on your stack) but let's be realistic. If your stack is so large that it alone requires more memory than a dev machine can spare with an IDE open, the cost of providing developers with capable workstations will pale in comparison to the cost of running the prod environment.

I have a client whose prod environment is 2x load balancer; 2x app server; 3x DB cluster node - all rented virtual machines. We just upgraded to higher spec machines to give headroom over the next couple of years (ie most machines doubled the RAM from the previous generation).

My old workstation bought in 2018 had enough memory that it could virtualise the current prod environment with the same amounts of RAM as prod, and still have 20GB free. My current workstation would have 80+ GB free.

In 95% of cases if you can't run the stack for a single user testing it, on a single physical machine, you're doing something drastically wrong somewhere.


"Most teams/products I have been involved in, the stack always grows to the point that a dev can no longer test it on their own machine"

Isn't this problem solved by CICD? When the developer is ready to test, they make a commit, and the pipeline deploys the code to a dev/test environment. That's how my teams have been doing it.


This turns a 1 hour task into a 1 day task. Fast feedback cycles are critical to software development.

I don't quite understand how people get into the situation where their work can't fit on their workstation. I've worked on huge projects at huge tech companies, and I could run everything on my workstation. I've worked at startups where the CI situation was passing 5% of the time and required 3 hours to run, that you can now run on your workstation in seconds. What you do is fix the stuff that doesn't fit.

The most insidious source of slowness I've encountered is tests that use test databases set to fsync = on. This severely limits parallelism and speed in a way that's difficult to diagnose; you have plenty of CPU and memory available, but the tests just aren't going very fast. (I don't remember how I stumbled upon this insight. I think I must have straced Postgres and been like "ohhhhhhhhh, of course".)


It's likely you haven't come across these use cases in your professional career, but I assure you its very common. My entire career has only seen projects where you need dozen to hundreds of CPU's in order to have a short feedback loop to verify the system works. I saw this in simple algorithms in automotive, to Advanced Driver Assistance Systems and machine learning applications.

When you are working on a software project that has 1,000 active developers checking in code daily and require a stable system build you need lots of compute.


There's a lot of folks in startups who think 100 devs is a large org and can't comprehend the scale at which '100% tests pass' stops being a build blocker. I've migrated from such an org to a late stage startup and 'tests must pass' even if fifty engineers are blocked with their PRs and the release train is fully halted. 'But our pipelines must be green' no they don't, at least not all of them.


You just need faster tests. ;)

Also, if you're booting kernel or device drivers you need the hardware. Some of this is not desktop hardware.


When you want to run the whole test suite, yes.

When you're developing and only need to touch 0.1% of the product and 0.001% of the code, that's a total and complete waste of time.


Really? I can't imagine not running the code locally. Honestly, my company has a micro services architecture, and I will just comment out the docker-compose pieces that I am not using. If I am developing/testing a particular component then I will enable it.

How tightly coupled are these systems?


The stack(factory) must grow.


I strongly recommend just switching the Dev environment over to Linux and taking advantage of tools like "distrobox" and "toolbx".

https://github.com/89luca89/distrobox

https://containertoolbx.org/

It is sorta like Vagrant, but instead of using virtualbox virtual machines you use podman containers. This way you get to use OCI images for your "dev environment" that integrates directly into your desktop.

https://podman.io/

There is some challenges related to usermode networking for non-root-managed controllers and desktop integration has some additional complications. But besides that it has almost no overhead and you can have unfettered access to things like GPUs.

Also it is usually pretty easy to convert your normal docker or kubernetes containers over to something you can run on your desktop.

Also it is possible to use things like Kubernetes pods definitions to deploy sets of containers with podman and manage it with systemd and such things. So you can have "clouds of containers" that your dev container needs access to locally.

If there is a corporate need for window-specific applications then running Windows VMs or doing remote applications over RDP is a possible work around.

If everything you are targeting as a deployment is going to be Linux-everything then it doesn't make a lot of sense to jump through a bunch of hoops and cause a bunch of headaches just to avoid having it as workstation OS.


If you're doing this, there are many cases where you might as well just spin up a decent Linux server and give your developers accounts on that? With some pretty basic setup everyone can just run their own stuff within their own user account.

You'll run into occasional issues (e.g. if everyone is trying to run default node.js on default port) but with some basic guardrails it feels like it should be OK?

I'm remembering back to when my old company ran a lot of PHP projects. Each user just had their own development environment and their own Apache vhost. They wrote their code and tested it in their own vhost. Then we'd merge to a single separate vhost for further testing.

I am trying to remember anything about what was painful about it but it all basically Just Worked. Everyone had remote access via VPN; the worst case scenario for them was they'd have to work from home with a bit of extra latency.


The painful part of that setup is that all the tools you want to use on the source code must either run on the server itself, thus installed somehow, or some slow remote mounted filesystem, this severely limits the tools you may want to use.


What tools don't run on Linux? Modern tooling almost assumes Linux in most cases now. As a Windows user I feel like I hit this wall way more often than any other.


Running an IDE over ssh or samba is just an awful experience, but generally if I want a command line tool to be installed I need to be either root or I need to ask the administrator of the server to install it, on my own machine I can install whatever I want and I can run whatever operating system or distro I want.

And if I'm traveling I can bring my laptop with me, can't do that with a server.


Oh right sorry I did forget about that aspect. I haven't done that for a while with a big codebase with, say, VS Code, and not on anything that wasn't very low latency (sub 10ms, so local LAN). I am mostly editing directly on the server when I do it these days but that wouldn't fly for anything other than the light hacking I do.


These days with many working on remote location you also need to include the VPN latency.


This.

Distrobox and podman are such a charm to use, and so easily integrated into dev environments and production environments.

The intentional daemon free concept is so much easier to setup in practice, as there's no fiddly group management necessary anymore.

Just a 5 line systemd service file and that's it. Easy as pie.


OP here. There definitely is a place for running things on your local machine. Exactly as you say: one can get a great deal of consistency using VMs.

One of the benefits of moving away from Kubernetes, to a runner-based architecture , is that we can now seamlessly support cloud-based and local environments (https://www.gitpod.io/blog/introducing-gitpod-desktop).

What's really nice about this is that with this kind of integration there's very little difference in setting up a dev env in the cloud or locally. The behaviour and qualities of those environments can differ vastly though (network bandwidth, latency, GPU, RAM, CPUs, ARM/x86).


> The behaviour and qualities of those environments can differ vastly though (network bandwidth, latency, GPU, RAM, CPUs, ARM/x86).

For example, when you're running on your local machine you've actually got the amount of RAM and CPU advertised :)


"Hm, why does my Go service on a pod with 2.2 cpu's think it has 6k? Oh, it thinks it has the whole cluster. Nice; that is why scheduling has been an issue"


Something that's not clear from the post is whether you're running these environments on your own hardware, or layering things on top of something from a cloud provider (AWS, etc)?


Hi Christian. We just deployed Gitpod EKS at our company in NY. Can we get some details on the replacement architecture? I’m sure it’s great but the devil is always in the details.


Need middleware libs that react to eBPF data and signal app code to scale up/down forks in their own memory VM, like V8

Kubernetes is another mess of userspace ops tools. Userspace is for composable UI not backend. Kube and Chef and all those other ops tools are backend functionality being used like UI by leet haxxors


In my last role as a director of engineering at a startup, I found that a project `flake.nix` file (coupled with simply asking people to use https://determinate.systems/posts/determinate-nix-installer/ to install Nix) led to the fastest "new-hire-to-able-to-contribute" time of anything I've seen.

Unfortunately, after a few hires (hand-picked by me), this is what happened:

1) People didn't want to learn Nix, neither did they want to ask me how to make something work with Nix, neither did they tell me they didn't want to learn Nix. In essence, I told them to set the project up with it, which they'd do (and which would be successful, at least initially), but forgot that I also had to sell them on it. In one case, a developer spent all weekend (of HIS time) uninstalling Nix and making things work using the "usual crap" (as I would call it), all because of an issue I could have fixed in probably 5 minutes if he had just reached out to me (which he did not, to my chagrin). The first time I heard them comment their true feelings on it was when I pushed back regarding this because I would have gladly helped... I've mentioned this on various Slacks to get feedback and people have basically said "you either insist on it and say it's the only supported developer-environment-defining framework, or you will lose control over it" /shrug

2) Developers really like to have control over their own machines (but I failed to assume they'd also want this control over the project dependencies, since, after all, I was the one who decided to control mine with the flake.nix in the first place!)

3) At a startup, execution is everything and time is possibly too short (especially if you have kids) to learn new things that aren't simple, even if better... that unfortunately may include Nix.

4) Nix would also be perfect for deployments... except that there is no (to my knowledge) general-purpose, broadly-accepted way to deploy via Nix, except to convert it to a Docker image and deploy that, which (almost) defeats most of the purpose of Nix.

I still believe in Nix but actually trying to use it to "perfectly control" a team's project dependencies (which I will insist it does do, pretty much, better than anything else) has been a mixed bag. And I will still insist that for every 5 minutes spent wrestling with Nix trying to get it to do what you need it to do, you are saving at least an order of magnitude more time spent debugging non-deterministic dependency issues that (as it turns out) were only "accidentally" working in the first place.


People just straight-up don’t want to learn. There are always exceptions, of course, but IME the majority of people in tech are incurious. They want to do their job, and get paid. Reading man pages is sadly not in that list.


> They want to do their job, and get paid

Where "job" is defined in a narrowest way possible to assume minimum responsibility. Still want to get 200k+ salaries though...

This may sound extreme (it really isn't) but as Dr of Eng TP's job was to sus those folks out as early as possible and part ways (the kind where they go work for someone else). Some folks are completely irrational about their setups and no amount of appeasement in the form of "whys" and training is usually sufficient.


This has always made me sad, but I think you're right in a lot of cases. What I've always tried to do is to focus on basic productivity; make sure everyone has everything they need to do their work, and that most people do it in the same way, so you can make progress on the learning journey together. Whenever people ask me for help and want to set up a meeting (not just "please answer this on Slack and I'll leave you alone"), I record the meeting, try to touch on all the related areas of their problem, and then review the recording for things that would be interesting to write about it. If any of the digressions are interesting, I go into Notion, create a new page, and write up a couple paragraphs. Then I give my team "ever wonder what dynamic linking is and how to debug it?" and they can read it and know as much as I know.

I really, really struggle to deal with the fact that people don't know as much as I do (I wrote my first program when I was 4 and I'm 39 now), but I have accepted that it's not a weakness on their part, it's a weakness on my part. I wouldn't lower my standards (as a manager once suggested), but I do feel like it's my obligation to lead them on a journey of learning. That is to say, people don't learn without teaching, so be a teacher.


> People just straight-up don’t want to learn. (...) but IME the majority of people in tech are incurious. They want to do their job, and get paid. Reading man pages is sadly not in that list.

I don't think you know what you're talking about. Just because you know people who do not want to waste their time on a set of unproductive chores you arbitrarily singled out, that does not mean they are against learning.

Your take is particularly absurd considering the topic: engineers working on distributed services.

Do you actually believe that you build up enough knowledge on this topic to become a professional in the field if you "straight-up don't want to learn"? There is not a single developer in the field who, at least to some degree, is not self-taught.

> They want to do their job, and get paid.

Everyone wants to get paid. Do you know anyone who works non-profit?

What you're failing to understand is the "do their job" part. Software developers are trained to solve the problems they face, and not waste time with the problems they do not have. Time is precious, and they invest it where it has the largest return on investment.

> Reading man pages is sadly not in that list.

Man pages are notoriously a colossal waste of time. In general they are poorly thought out, they are incomplete, they were written with complete disregard for user experience, and more often than not they are way out of date.

Why do you think sites like Stack overflow is so popular? Because all those "incurious" people in tech feels the need to ask questions and dig through answers on how to solve problems?

I think you're just picking a very personal definition of competence which conveniently boils down to "do the things I do, and do not do the things I don't". Except the bulk of the people in the field is smart, and some have already solved problems that you aren't aware exist, such as wasting precious time deciphering unreadable documents that are systematically out of date.


> Do you actually believe that you build up enough knowledge on [distributed services] to become a professional in the field if you "straight-up don't want to learn"?

Given the modern hiring practice of "can you pass Leetcode," and "can you memorize and regurgitate how to architect a link shortener," yes, yes I do. There is a vast difference between learning to pass a test, and learning because you're sincerely interested in the topic.

> Everyone wants to get paid. Do you know anyone who works non-profit?

Of course we all want to get paid. The intent of the sentence, as I think you know, was that many lack intrinsic motivation, of learning for the sake of learning.

> What you're failing to understand is the "do their job" part. Software developers are trained to solve the problems they face, and not waste time with the problems they do not have.

I think what you're failing to understand is that there is a difference between a factory worker and a craftsman. There is absolutely nothing wrong with factory work, to be clear here – I in no way intend to disparage honest work – I just personally find it a difficult personality to work alongside.

> Time is precious, and they invest it where it has the largest return on investment.

To me, this reads as "be selfish." The fastest way to get an answer is to ask someone who knows. This is not, however, the best way to retain knowledge, nor is it considerate of others' time. That's not to say you shouldn't ask for help, but it's a much different ask when you come to someone saying, "this is what I'm trying to do, this is what I've done, and this has been my result – can you help?"

I can't tell you the number of times someone has DM'd me asking for help on something I've never touched, but by reading docs, have solved. I always try to reinforce that by linking to the docs in the answer, but it hasn't proven to be a successful method of deterring future LMGTFY.

> Man pages are notoriously a colossal waste of time.

Citation needed.

> In general they are poorly thought out

Do you have some specific examples?

> They are incomplete

See above; also, if you've found this to be true, have you considered giving back by updating them?

> They were written with complete disregard for user experience

They were and are written for people who wish to understand their tools, not for people who want a 5 minute Medium post that contains the code necessary to complete a task.

> And more often than not they are way out of date.

I can't think of a time where the man pages _included with a tool_ were out of date. If your system is itself out of date, I can see where this could be true. Again, do you have some specific examples?

> Why do you think sites like Stack overflow is so popular? Because all those "incurious" people in tech feels the need to ask questions and dig through answers on how to solve problems?

SO is a great site, with a dizzying variety of quality in its questions and answers. Take one of (the?) most upvoted answers ever, on branch prediction [0]. The question itself isn't easily answerable via reading docs, and as the answer shows, is surprisingly deep. Next, a highly-upvoted question about how to reset local git commits [1]. This is a question that _is_ easily answerable by reading docs [2]. Or a question on what `__main__` is [3] in Python. A fair question (it is somewhat odd from the outside, especially if you have no experience in Python, have no idea what dunder methods are, etc.), but again, one that's easily answerable by reading docs [4].

> I think you're just picking a very personal definition of competence which conveniently boils down to "do the things I do, and do not do the things I don't".

Of course I think that the way I do things is mostly correct; otherwise why would I be doing them?

> Except the bulk of the people in the field is smart, and some have already solved problems that you aren't aware exist, such as wasting precious time deciphering unreadable documents that are systematically out of date.

Strawman aside, I never said people in tech aren't smart, I said they're largely incurious. Words matter.

[0]: https://stackoverflow.com/a/11227902/4221094

[1]: https://stackoverflow.com/questions/927358/how-do-i-undo-the...

[2]: https://git-scm.com/docs/git-reset#Documentation/git-reset.t...

[3]: https://stackoverflow.com/questions/419163/what-does-if-name...

[4]: https://docs.python.org/3/library/__main__.html


> Given the modern hiring practice of "can you pass Leetcode," and "can you memorize and regurgitate how to architect a link shortener," yes, yes I do.

You are contradicting yourself. If there's anything that requires studying and preparation, that's leetcode.

Also, "memorize and regurgitate how to architect a link shortener" is also known as learning and knowing the basics of systems architecture and software architecture. That's an odd way of criticising others for being more competent than you.


> Of course I think that the way I do things is mostly correct; otherwise why would I be doing them?

"My way is correct" doesn't necessarily imply "other ways are incorrect". Sometimes there's not one single solution to a problem. I love using Linux for my personal machines, but I don't think that people who don't are doing things wrong; they just have different preferences on how to do things, and that's fine.


> Man pages are notoriously a colossal waste of time. In general they are poorly thought out, they are incomplete, they were written with complete disregard for user experience, and more often than not they are way out of date.

Uh, what? What man pages are you reading? I read manpages all the time, and I've never run into an issue where one contained info that was untrue because outdated. The only manpages I've ever read that I'd characterize as incomplete are Apple's.¹

> Why do you think sites like Stack overflow is so popular? Because all those "incurious" people in tech feels the need to ask questions and dig through answers on how to solve problems?

One of the reasons Stack Overflow is so popular is that people who can't/won't read docs can use it to have answers spoonfed to them, often by people who only differ from them in being more willing/able to read the docs. Isn't that extremely obvious?

> unreadable documents

Reading isn't a singular skill— each genre requires its own skills, and you gradually pick those up by reading in that genre. Reading novels doesn't much prepare you to read math textbooks, but that doesn't make all math textbooks 'unreadable'.

The same things goes for skimming. Skimming a text is likewise a (set of) genre-specific skill(s), built up through practice.

Frankly, moving from your terminal to your web browser to look up how to use a CLI tool is only consistently faster than working with the docs native to that CLI environment (man pages, info pages, usage messages, --help flags, help subcommands, tldr pages, etc.) if you have don't have very good reading skills in the genres of those native docs.

As someone who does not have difficulty skimming or navigating manpages quickly, when someone tells me that digging through StackOverflow seems like less of a waste of time than reading docs and so they never read docs, I have to wonder if the real issue is that a reading skills deficit is caught in a self-reinforcing loop.

And indeed, a trip to StackOverflow never ends at StackOverflow for a person with much curiosity. Because even if a curious person finds a solution to their immediate problem, they will wonder things like:

  - is this solution outmoded by some other fix?
  - how is the feature/option/change used in this solution actually supposed to work?
  - are there any alternatives I should know about?
  - if I wanted to do things slightly differently, could I still use the method/feature/option referenced in this solution? does it have any parameters that are easy to swap or tweak?
  - is this scenario what the feature/method/option in the solution is actually intended for? should I care?
... and the quickest way to answer questions like that is usually a glance at a manual.

-----

1: In some cases with GNU stuff the literal `man` pages are abridged versions of the `info` pages. But even then, the `man` pages direct you to `info` pages. It's not like they leave you having.


> And indeed, a trip to StackOverflow never ends at StackOverflow for a person with much curiosity.

My favorite variety of SO question is "how do I do X in $LANGUAGE," because inevitably, people pile in with various answers, and then someone starts benchmarking all of them and providing graphs. Occasionally someone even breaks down the assembly instructions for each solution and explains why one is superior to the other. All in all, a fanatical obsession over something small and relatively unimportant, because they like to learn, and they like to share what they've learned.


> Uh, what? What man pages are you reading?

Every single man page out there leads to a user experience that is at best subpar.

> One of the reasons Stack Overflow is so popular is that people who can't/won't read docs can use it to have answers spoonfed to them (...)

Pause and look at what you're saying. Your only criticism of SO is how it improves the task of providing meaningful information to users.

The way you opt to spin improvements to user experience as "spoonfed" speaks volumes of your inability to understand the problem and the value you place on gratuitous ladder-pulling. You even contradict your remarks on man pages.

> Reading isn't a singular skill— each genre requires its own skills (...)

No. Writing is a skill. Producing content that the target audience is able to consume and brings value is a skill. The moment you, as a end-user, feel the need to hunker down and decipher arcane texts is the moment you should realize the documentation is bad.

Again, Stack overflow is widely used as ad-hoc crowd-sourced documentation for a reason. Some project maintainers even go as far as to make it their own channel to provide technical support. Why so? Do you honestly believe its because the whole world is not smart enough to read man pages?

Again, those who do not waste their time on man pages are the smart ones who put their own time to better use.


> Your only criticism of SO is how it improves the task of providing meaningful information to users... the value you place on gratuitous ladder-pulling

How is wanting others to learn ladder-pulling? Also, how do you assume people will have this kind of information handed to them when the people who are interested in deeply learning stop doing so, die off, etc.? If you say AI, first of all, best of luck with the hallucinations, but secondly, who is going to work on and train the AI?

> No. Writing is a skill. Producing content that the target audience is able to consume and brings value is a skill. The moment you, as an end-user, feel the need to hunker down and decipher arcane texts is the moment you should realize the documentation is bad.

I think I see the root disagreement here. You continue to mention "value," as though reading is itself not valuable. Sitting down to read a work of fiction arguably brings no value to anyone (except perhaps the author and publisher), yet millions do it anyway. Similarly, if I find a way to do something, I usually want to know if there are also other ways, and if so, if they're better. There's not much "value" there most of the time, but it brings me happiness, and enhances my knowledge of the subject.


> time is possibly too short (especially if you have kids) to learn new things that aren't simple, even if better

Having a kid has drastically altered my ability to learn new things outside of work, simply due to lack of time. I never could have imagined how big of an impact having a kid would be, its crazy!

The worst thing is when you actually manage to carve out some time to do some learning or experimentation with a new tool, library, etc only to find out that it sucks or you just don't have the time to pick up or whatever.


> Having a kid has drastically altered my ability to learn new things outside of work, simply due to lack of time. I never could have imagined how big of an impact having a kid would be, its crazy!

yeah, I could have written this verbatim. Either I was not warned enough, or I did not pay enough attention/heed whatever I was warned of. I don't have a large family, so I've basically had ZERO kid experience since I was a kid... yikes... almost 50 years ago LOL. What worries me though is that it's kind of been an assumption at this job that you DO spend some off-duty time learning/tinkering. And I enjoyed it!

> The worst thing is when you actually manage to carve out some time to do some learning or experimentation with a new tool, library, etc only to find out that it sucks

I got briefly excited about the V language to maybe use for little utility scripts and maybe even as a first teaching language for my kid, then realized that when you scratch the surface of it it's basically kind of ugly underneath. (Example- The "this should never happen" error was literally most of the errors, lol.) It looks like something with a lot of great ideas but slipshod not-deeply-thought-out implementation. And the final nail in the coffin was all the evidence that the language creator simply bans anyone with valid criticism- I'm a free-speech near-absolutist so that one was the killer for me.

One of a few examples of what you're referring to. The thing is, before kids, we could afford to waste that time. Now we cannot. :/


From my perspective, installing Nix seems pretty invasive. I can understand if someone doesn't want to mess with their system "unnecessarily" especially if the tool and it's workings are foreign. And I can't really remember the last time I had issues with non-deterministic dependencies either. Dependency versions are locked. Maybe I'm missing something?


> installing Nix seems pretty invasive

I'm writing a test to check whether a tool I'm writing can work without Nix (it works with it perfectly, but I want it to also work without it because there are a lot of folks like you, and like me about 3 years ago, who still think they'd rather struggle with manually installing the right glibc that goes with the right python dependency installed with the right pip and venv versions, to the right location, that goes with the right python version that makes Whisper models work (literally the thing I'm currently working on), instead of just running `nix develop` and getting a coffee and then done.

And all I have to do to simulate "no Nix" is to remove all the nix paths from PATH (I suppose I could purge it from the linker paths as well, now that I think about it). But that's it.

What Nix does is put its entire repo into a separate part of your hard drive owned by root, and create a few build users for security reasons. That's (to me) not particularly "invasive," but YMMV (and if you use the Determinate Nix installer, it's even more trivial to uninstall than the official way). Also, when you run `nix develop`, the environment changes it does to make everything "just work" (like PATH changes etc) are only valid for that terminal session. Again, this is the least intrusive thing possible while also providing the guarantees it does, and is also (more or less) guaranteed to work.

The Nix whitepaper is pretty readable and not that long. I recommend it to understand why it's important and useful: https://edolstra.github.io/pubs/nspfssd-lisa2004-final.pdf

There is also Guix, which is like Nix but uses Guile (a Scheme dialect) as its scripting language all the way down to the bare metal (literally, the boot loader is written in it, I believe, as soon as the interpreter is loaded somehow). Their strategy seems to be to let Nix take the lead and make all the mistakes and then implement the way that seems to work the best, in its own ecosystem/tooling: https://guix.gnu.org/ But they have a lot fewer packages than Nix does.

Both of these let you define an entire machine with a single configuration file that is far more guaranteed to work than running a Dockerfile.


Guix uses GRUB like most distros (but is scripted a bit to allow booting earlier generations of the system). The init system is Guile (Shepherd).

Fewer packages, yes, but the packages are by far the most common ones. It's easy to add packages for yourself, if needed. Nonguix channel and others for stuff upstream won't accept.

I believe Guix is innovating on a number of things in relation to Nix these days. So I've heard. I don't know much about Nix, honestly.

I heartily agree with your last paragraph in the case of Guix.


Thanks for providing some concrete examples and explaining how it works. Sounds like a reasonable setup to automate. I'm actually running NixOS at home on a few hosts (but not on my desktop), so Nix isn't entirely foreign to me, but I haven't (yet?) used it with other distros or for developing as my needs have been simpler. In your case, I probably would have chosen the recommended path.

In my current work project, we use Windows and .NET with some libs and tools. Nothing too complicated, but automating it would be nice. I could probably push for it a bit, but I'm not familiar with automating Windows environments, since I mainly use Linux at home.


Nix on MacOS makes a bunch of build users and has a bad habit of touching files that then get touched by MacOS updates, leading to a system that works... then doesn't work.


The only such case I’ve seen was the early betas of Sequoia requiring an extra flag to use higher UIDs for build users.


The Determinate Nix installer flavor is supposedly more resilient to this. Have survived at least 1 OS update unscathed with it.


Typically when you start a new dev job the company will provide you with a pre-provisioned laptop that has their security stuff setup and maybe dev tools already installed, eg source code, compilers, VMs, Nix, and a supported editor, so it's not exactly a personal machine that they're messing with.


Sure, and personally, I have no issue installing and using recommended tools since it's indeed just a work laptop, but I've seen more tuned setups too (see parent's point 2: "Developers really like to have control over their own machines")


> From my perspective, installing Nix seems pretty invasive.

How so? With what other software does Nix interfere?


From the provided link:

> Nix requires a broad set of changes to your system, from creating new users to installing and running a daemon to creating a root volume and beyond


The daemon and the users are only to run a build when that is necessary. It might seem invasive but it works out pretty well.


Ok, I see what you mean now.

What the post is trying to do there is motivate the creation of a new installer, including to the existing Nix community. The snippet you've highlighted is essentially correct, but I still wouldn't characterize Nix as particularly invasive.

The only that Nix strictly needs is to be plugged into your shell. That's it. It doesn't need deep or special hooks into a system just to function.

But including the daemon enables sandboxing for builds that Nix performs, which improves both the security and isolation of those builds, and it also lets Nix be shared nicely between unprivileged users on multiuser systems. For those reasons, daemonful installs are the default and with them come the system users.

(Adding system users is pretty much bog standard stuff for Unix system software, since the main kind of security boundary designed into that system is boundaries between users. Indeed, that's exactly what that's used for with Nix, too.)

The two things I described above comprise the totality of what is required to enable all of Nix's functionality. Everything else that the Determinate Nix installer does as of now is to work around or avoid macOS quirks, and is totally unnecessary for using Nix on any other OS.

The 'root volume' stuff is the result of a collision between the historical and conventional location of the Nix store at `/nix` and Apple's later imposition of a read-only root partition. So Nix installers do a little Apple-specific dance that creates a kind of filesystem volume that doesn't take up any real space or involve any physical partitioning of the disk when they run on macOS.

The other thing this installer does is build in an attempt to self-repair the damage that Apple inflicts upon Nix's sole real requirement by having macOS unconditionally clobber the shell config files under /etc during major macOS updates.

That's it. That's an exhaustive list of all the things a Nix installer does and why. It's not particularly tricky, or hard to remember or figure out. It's not even hard to undo manually— before the Determinate Nix installer existed, I sometimes uninstalled Nix by hand while manually testing the macOS bootstrap scripts for my dotfiles. It was annoying to do, and the uninstallation functionality of the Determinate Nix installer is extremely reliable and convenient and nice. But anyone who knows what `$PATH` is and has ever run `man` before could completely uninstall Nix even if some joker walked over to their machine and deleted the uninstaller.

At the same time, none of the changes Nix installers make on your system affect the behavior of outside programs at all, except by exposing what you choose to install via Nix through standard Unix environment variables like PATH.

Lacking things like kernel components, automatic self-updates, or the requirement for privileged APIs (e.g., on macOS, the endpoint security APIs and accessibility APIs), Nix is not only far less invasive than any endpoint security software, monitoring software, or MDM software you are likely to run on a work machine, but I'd argue tons of common desktop software like Zoom, Discord, DisplayLink and tons of popular macOS powertools like Amphetamine, SteerMouse, SoundSource, etc.

Plus the uninstall procedure with the DetSys installer and its forks is totally conventional and leaves nothing behind: run uninstaller, thing gone.

Nix on macOS is admittedly not an installer-free, drag-and-drop app bundle like some lovely applications get to be. But at most workplaces it's not likely to crack the top 10 most invasive applications installed on the average developer machine, either. Nix installers are just very up front about the things they do set up.

All that said, there are reasonable people who find having a daemon at all offensive. People who are deeply committed to minimalism or simplicity might prefer a single-user install or to use some other tool. But I think for most people, Nix is imo more than fine in terms of invasiveness.


I trialed for a job where the CTO was convinced of dev environments in kube as "the way to work". Everyone else was at least ambivalent. I joined, tried to make some changes that would let me run things locally. Every time I got pushback about using the dev environments instead.

It took me a couple of days to get a supervisor-based setup working locally. I was the only person on the team who would run the backend and frontend when trying things out, because nobody was actually using the dev environments fully anyways. There was no buy-in for the dev environment!

I really feel like if you are in a position to determine tooling, it's so much more helpful to lean into whatever people on the ground want to use. Obviously there are times when the people on the ground don't care, but if you're spending your sweat and tears to put the square peg into the square hole suddenly you're the person with superpowers, and not the person pushing their pet project.

And sometimes that's just "wrap my thing with your thing".


> I really feel like if you are in a position to determine tooling, it's so much more helpful to lean into whatever people on the ground want to use.

This might mean picking something that you think/know kind of sucks for the task, but that will be easier for most people to grok - while it might subjectively feel unfortunate, it's probably the right thing to do, for the sake of the majority of the team having an easier time.

Pushing your interests more strongly, or even in a top down fashion, might work, but that's more risky both in regards to letting everyone get things done, as well as team cohesion and turnover.


I think this, or something of equal complexity, is probably the right choice. I have spent a lot of time helping people with their dev environments, and the same problems keep coming up; "no, you need this version of kubectl", "no, you need this version of jq", "no, the Makefile expects THIS version of The Silver Searcher". A mass of shell scripts and random utilities was a consistent drag on the entire team and everyone that interacted with the team.

I ended up going with Bazel, not because of this particular problem alone (though it was part of it; people we hired spent WEEKS trying to get a happy edit/test/debug cycle going), but because proper dependency-based test caching was sorely needed. Using Bazel and Buildbuddy brought CI down from about 17 minutes per run to 3-4 minutes for a typical change, which meant that even if people didn't want to get a local setup going, they could at least be slightly productive. I also made sure that every dependency / tool useful for developing the product was versioned in the repository, so if something needs `psql` you can `bazel run //tools/postgres/psql` and have it just work. (Hate that Postgres can't be statically linked, though.)

It was a lot of work for me, and people do gripe about some things ("I liked `go test ./...`, I can't adjust to `bazel test ...`"), but all in all, it does work well. I would do it again. Day 1 at the company; git clone our thing, install bazelisk, and your environment setup is done. All the tests pass. You can run the app locally with a simple `bazel run`. I'm pretty happy with the outcome.

Nix is something I looked into for our container images, but they just end up being too big. I never figured out why; I think a lot of things are dynamically linked and they include their own /usr/lib tree with the entire transitive dependency chain for that particular app, even if other things you have installed have some overlap with that dependency chain. I prefer the approach of statically linking everything and only including what you need. I compromised by basing things on Debian and rules_distroless, which at least lets you build a container image with the exact same sha256 on two different machines. (We previously just did "FROM scratch; COPY <statically linked binary> /app; ENTRYPOINT /app", but then started needing things like pg_dump in our image. If you can just have a single statically-linked binary be your entire app, great. Sometimes you can't, and then you need some sort of reasonable solution. Also everything ends up growing a dependency on ca-certificates...)


> I think a lot of things are dynamically linked and they include their own /usr/lib tree with the entire transitive dependency chain for that particular app, even if other things you have installed have some overlap with that dependency chain. I prefer the approach of statically linking everything and only including what you need.

Wherever there's such overlap, those dependencies are already shared. Static linking in such a situation means more disk usage, not less.

Packages in Nixpkgs have large closure sizes for entirely other reasons, like not splitting packages as aggressively as they could be split, or enabling/including most optional dependencies by default. Distros like Alpine typically lean the other way for their defaults.

It's true that if you're willing to manually mangle them, static binaries are nice because you can very easily strip all docs and examples or even executables that you don't need, and still know your executable will have the libs it's linked against. In one place at work I actually do this with Nixpkgs— there's a pkgsStatic that includes only statically compiled packages. I pull just the tiny parts of some package I need out and copy them onto a blank OCI image because it was the path of least resistance.

But Nix also has some really nice tools for inspecting dependency graphs to figure out why large packages are getting pulled in. nix-tree is my favorite, but there's also the older nix-du that gives the same info via graphviz instead of the terminal, and the built-in `nix why-depends`.

-----

Edited to add: wait, are you saying you used some other base distro to create Docker images where some things were supplied by Nix and others came from the base distro? If so, yeah, Nix is going to bring all the dependencies along, all the way down to libc or whatever. That's required for the kind of hermeticity that is its goal.

Mixed images like that are always going to be larger. But you also don't need a base distro at all with Nix. You can use one of the existing Nix libraries for Docker/OCI stuff to generate a complete image from scratch, or just copy your Nix packages' dependency closure onto an empty image with a FROM SCRATCH Dockerfile.

If you can't do that, you can do various things to try to slim things down but it's best to just Nixify whatever other packages you're using so you don't need a base distro. (And if you're trying to save space, Nix itself doesn't need to be in your Docker images either, which can also cut out some deps.)


Just to answer your edit, nope, I wasn't mixing distros or anything. It would either be all Nix or all Debian.


Got it. Yeah, that would likely be easiest to resolve by exploring the closure with nix-tree and looking at how to eliminate or split dependencies. But pkgsStatic can be a handy fallback option too.


I think if you take about 80% of your comment and replace "Nix" with "Haskell/Lisp" and a few other techs, you'd basically have the same thing. Especially point #1.


Too true. I think there's a lot of people who don't want control; freedom is responsibility, as the saying goes, and responsibility can be stressful, even if it's liberating also.

Worse is better, sadly.


After my personal 2-year experiment with NixOS, I'd avoid anything Nix like the plague, and would be looking for a new job if anyone instituted a Nix-only policy.

It's not the learning new things that's a problem, but rather the fact that every little issue turns into a 2-day marathon that's eventually solved with a 1-line fix. And that's because the feedback loop and general UX is just awful - I really started to feel like I needed a sacrificial chicken.

Docker may be a dumpster fire, but at least it's generally easy to see what you did wrong and fix it.


Literally every Docker image you've ever downloaded comes from a build that only worked (essentially) "by accident."

If Docker builds were as deterministic as Nix, then all that would need to be distributed would be Dockerfiles and perhaps a cache of base images somewhere.

Looking at a build as a pure function where each dependency (including any compiler(s), plus the environment), are "input arguments" to it, was a revelation (since I already realized the advantages of pure functions while working in functional languages).

Running a Dockerfile and hoping to get a working image out of it is like running a function which checks the time when it runs and errors when the seconds end in 0 due to a bug.

> every little issue turns into a 2-day marathon that's eventually solved with a 1-line fix

There is spotty education in the space. Did you ever take this (very cool) Nix tutorial? Not actually understanding Nix is going to make any troubleshooting of Nix much harder. https://nixcloud.io/tour/

> I really started to feel like I needed a sacrificial chicken.

Have you looked at Guix? A lot of people think it's "Nix without the warts." Plus it uses a Lisp, which some people prefer, or can at least grok better than the Nix language. https://guix.gnu.org/


I watched all the videos, read all the tutorials (including the "super easy" ones). I understand the Nix language, but the actual infrastructure is just so damn convoluted and fragile, the error feedback mechanisms are worse than useless, and the documentation assumes that you already know everything inside and out. It reminds me of the bad old days of the neckbeard gatekeepers.

I'll take a look at guix, though...


"Working by accident" is still "working" though, so from the standpoint of the person using it, the reason doesn't matter. I'd also argue that the main point made above about docker wasn't that it's reliable, but that it gives developers more feedback when trying to work with it. For people who's goal isn't just to consume builds but write their own, existing Nix builds being deterministic does nothing to help them if trying to modify or borrow parts of them leads to issues they can't easily fix.

I think Nix fits a pattern that's happened in plenty of other domains where the technology that focused on doing things "right" failed to win out against a competitor doing things "wrong" but optimizing for a lower barrier of entry. The logic that a perfect solution is worth an up-front cost is compelling, since having an imperfect solution has a long-term cost that never goes away, but this misses the fact that pushing the cost until later has value of its own; making things easier today at the cost of tomorrow buys time to improve things before the cost is incurred.

At the risk of a convoluted metaphor, imagine that someone moves into a new house and calls two plumbers asks them to hook up the water in their bathroom. The first plumber says that they can get it done so the bathroom can be used today, but they'll need to come by again in a week or two since they might need to make additional adjustments. The second plumber says they've come to with a way to make sure that they never need to come back to make adjustments, but it will take them a full week to finish setting it up before anyone can use it.

For most people, it doesn't matter if the second plumber's solution will be better next week if they need to use the bathroom today, as long as the first plumber's solution can last long enough before it needs to be fixed.


> ...but at least it's generally easy to see what you did wrong and fix it.

You're not actually "fixing" anything, you're just passing the ball of shit down the responsibility chain to the ops/infra team.

Which is fine if you work in a large corporation where this is a valid strategy.

Unfortunately though the software supply chain problem is a) very difficult and b) unavoidable.

Nix is the best (or maybe only) attempt to solve this problem with programmatic (vs organizational) tooling.


I've got nothing against the fundamental concepts that Nix strives for. In fact, that's what triggered my 2 year journey with it. I just hate the implementation with a passion. The overall result is worse than before.


There is no possible way to solve anything in this problem space without triggering developer PTSD. The implementation is not at fault.

(See npm or the clusterfuck of Python packaging for proof.)


I love the idea/design of Nix and hope that one day someone will reimplement it in a way that one can reasonably understand/debug. Language is part of the problem but I think it's more of a https://www.lihaoyi.com/post/SowhatswrongwithSBT.html style problem where the execution model involves too much at runtime.


> you either insist on it and say it's the only supported developer-environment-defining framework, or you will lose control over it

That's true for any architectural decision in an organization with more than 1 person.

It's really not something that should make you reconsider a decision. At the end of the day, an architecture that "people" actually want to use doesn't exist, "people" doesn't want any singular thing.


Try Devbox, you can basically ignore nix entirely and reap all the benefits.


When I looked at it, I encountered some problem that unfortunately slips my mind now.


We have issues with it from time to time, it's under fairly active development, give it a try.


In a worse world, worse is better.


this is such an awful truth to even say aloud lol but... yeah.

which is why you gotta find your peeps that believe in the better world and are thus believing it into existence. (OT: reminds me of this song title: https://soundcloud.com/anjunabeats/mat-zo-see-it-when-i-beli...)


> there is no (to my knowledge) general-purpose, broadly-accepted way to deploy via Nix

`nix copy .#my-crap --to ssh://remote`

What you do with it then on the remote depends on your environment. At the minimum do a `nix-store --add-root` to make a symlink to whatever you just copied.

(The most painless path is if you're deploying an entire NixOS system, but that requires converting the remote host to NixOS first.)


Heh, yeah. You gotta put in writing that only userlands defined in Nix will be eligible to enter any environment beyond "dev". And (also put in writing) that their performance in the role will be partly evaluated on their ability to reach out for help with Nix when they need it.


Hello. Currently debugging my kubernetes-based dev pod and not getting anything else done. What fun!


> how a developer manages assets like source code

IMO there are some workloads, where it is beneficial for a developer to have access to a local repository with at least some snippets based on previous projects.

Having a leftover PoC of some concept written for a previous employer but never elevated to team use/production is both handy (at least to confirm that the build environment is still viable after an unspecified period of toolchain updates) and ethical (copying production code is not ethical - even if the old and new products are vastly different e.g. last job was taxi app, new app is banking app).

Making it all 'remote' and 'cloud' will eventually result in a bike reinvention penalty on each new employment - not everything can be rebuilt from memory only, especially things that are done 1-2 times a year; sure there is open-source documentation/examples, but at some point it'll just introduce even heavier penalty for a need to either know a lot of opensource stuff to have some reference points, or to work on a pet projects to get the same amount of references.


Are you suggesting that you should enable the employee to move work done on company time and that is the company’s IP to a new company?

And the new company would also be liable for using trade secrets that they shouldn’t.


Neither, it's unethical and there's no possibility of doing that in legal way.

However I do write 1-2 hour PoCs on my spare time and my own equipment, using only publicly available stuff - they sometimes come handy at some point later. If we assume 'remote first' development is okay - with no possibility to test stuff locally, well, we're back to either bookmark managers or pet projects to keep at least a bit of knowledge between jobs.


> The only real downside is data control (ie - the company has less control over how a developer manages assets like source code).

I've worked in a remote, secured development environment and it sucked, but to their credit the company did it for exactly this reason - control over the source. But bear in mind that source control is a two-way street.

Losing proprietary source can be harmful (especially in compiled languages where the source might carry much more information than the distributable). But they were mostly worried about the opposite way...that something malicious gets INTO the source which could pose an existential threat. You'd be correct to say "well that should be the domain of source control, peer review etc", but in this case the company assessed the risk high enough to do both.


I've no problem with remote dev envs for most things. But they have to be VMs in many cases, not containers.


I think nowadays the value of source code is rarely a more valuable asset than the data being processed. Also I would prefer to give my devs just a second machine to run workloads and eventually pull in data or mock the data so they get moving more easily.


Sounds like you are not using a lot of hardware - Rfid, POS, top-spec video cards, etc


> The only real downside is data control (ie - the company has less control over how a developer manages assets like source code). ). I'm my experience, the vast majority of companies should worry less about this [...]

I once had to burn a ton of political capital (including some on credit), because someone who didn't understand software thought that cutting-edge tech startup software developers, even including systems programmers working close to metal, could work effectively using only virtual remote desktops... with a terrible VM configuration... from servers literally halfway around the world... through a very dodgy firewall and VPN... of 10Mb/s total bandwidth... for the entire office of dozens of developers.

(And no other Internet access from the VMs. Administrators would copy whatever files from the Internet that are needed for work. And there was a bureaucratic form for a human process, if you wanted to request any code/data to go in or out. And the laptops/workstations used only as thin-clients for the remote VMs would have to be Windows and run this ridiculous obscure 'endpoint security' software that had changed hands from its ancient developer, and hadn't even updated the marketing materials (e.g., a top bulletpoint was keeping your employees from wasting time on a Web site that famously was wiped out over a decade earlier), and presumably was littered with introduced vulnerabilities and instabilities.)

Note that this was not something like DoD, nor HIPAA, nor finance. Just cutting-edge tech on which (ironically) we wanted first-mover advantage.

This escalated to the other top-titled software engineer and I together doing a presentation to C-suite, on why not only would this kill working productivity (especially in a startup that needed to do creative work fast!), but the bad actors someone was paranoid about could easily circumvent it anyway to exfiltrate data (using methods obvious to the skilled software people like they hired, some undetectable by any security product or even human monitoring they imagined), and all the good rule-following people would quit in incredulous frustration.

Unfortunately, it might not have been even the CEO's call, but a crazy investor.


That's fine for some. However it's not always that. I wrote an entire site on my ipad in spare time with GitPods. Maybe you are at a small company with a small team so if things get critical you are likely to get a call. Do you say F'it, do you carry your laptop, or do you carry your ipad like you already are knowing you can still at least do triage if needed because you have a perfectly configured gitpod to use.


If your app fits on one machine, I agree with you: you absolutely should not use cloud dev environments in my opinion (and I've worked on large dev infra teams, that shipped cloud dev environments). The performance and latency of a Macbook Pro (or Framework 13, or whatever) is going to destroy cloud perf for development purposes.

If it doesn't fit on one machine, though, you don't have another option: Meta, for example, will never have a local dev env for Instagram or Blue. Then you need to make some hard choices.

Personally, my ideal cloud dev env is:

1. Local checkout of the code you're working on. You can use whatever IDE or text editor you prefer. For large monorepos, you'll need some special tooling to make sure it's easy to only check out slices of the repo.

2. Sync the code to the remote execution environment automatically, with hot-reloading.

3. Auto-port-forward from your local machine to the remote.

4. Optionally be able to run dependent services on your personal remote to debug/test their interactions with each other, and optionally be able to connect to a well-maintained shared environment for dependencies you aren't working on. If you have a shared environment, it can't be viewed as less-important than production: if it's broken, it's a SEV and the team that broke it needs to drop everything and fix it immediately. (Otherwise the shared env will be broken all the time, and your shipping speed will either drop, or you'll constantly be shipping bugs to prod due to lack of dev care.)

At Meta we didn't have (1): everyone had to use VSCode, with special in-house plugins that synced to the remote environment. It was okay but honestly a little soul-sucking; I think customizing your tooling is part of a lot of people's craft and helps maintain their flow state. Thankfully we had the rest, so it was tolerable if not enjoyable. At Airbnb we didn't have the political will to enforce (4), so the dev env was always broken. I think (4) is actually the most critical part: it doesn't matter how good the rest of it is, if the org doesn't care about it working.

But yeah — if you don't need it, that's a lot of work and politics. Use local environments as long as you possibly can.


> Personally - just let the developer own the machine they use for development.

It'll work if the company can offer something similar to EC2. Unfortunately most of the companies are not capable of doing so if they are not on cloud.


I’m not sure we should leap from:

> I have seen several attempts to move dev environments to a remote host. They invariably suck.

To “therefore they will always suck and have no benefits and nobody should ever use them ever”. Apologies for the hyperbole but I’m making a point that comments like these tend to shut down interesting explorations of the state of the art of remote computing and what the pros/cons are.

Edit: In a world where users demand that companies implement excellent security then we must allow those same companies to limit physical access to their machines as much as possible.


But they don't suck because of lack of effort - they suck because there are real physical constraints.

Ex - even on a VERY good connection, RTT on the network is going to exceed your frame latency for a computer sitting in front of you (before we even get into the latency of the actual frame rendering of that remote computer). There's just not a solution for "make the light go faster".

Then we get into the issues the author actually laid out quite compellingly - Shared resources are unpredictable. Is my code running slowly right now because I just introduced an issue, or is it because I'm sharing an env and my neighbor just ate 99% of the CPU/IO, or my network provider has picked a different route and my latency just went up 500ms?

And that's before we even touch the "My machine is down/unreachable, I don't know why and I have no visibility into resolving the issue, when was my last commit again?" style problems...

> Edit: In a world where users demand that companies implement excellent security then we must allow those same companies to limit physical access to their machines as much as possible.

And this... is just bogus. We're not talking about machines running production data. We're talking about a developer environment. Sure - limit access to prod machines all you like, while you're at it, don't give me any production user data either - I sure as hell don't want it for local dev. What I do want is a fast system that I control so that I can actually tweak it as needed to develop and debug the system - it is almost impossible to give a developer "the least access needed" to do development locally because if you know what that access was you wouldn't be developing still.


> But they don't suck because of lack of effort - they suck because there are real physical constraints.

They do suck due to lack of effort or investment. FANG companies have remote dev experiences that are decent - or even great - because they invest obscene amounts into dev tooling.

There physical constraints on the flipside: especially for gigantic codebases or datasets that don't fit on dev laptops or have need lower latencies to other services in the DC.

Added bonus: smaller attack surface area for adversaries who want to gain access to your code.


It isn't just the tooling though.

At least with Google, they also have a data center near where most developers work, so that they have much lower latency.

They can't make the light go faster, but they can make it so it doesn't go as far. Smaller companies usually don't have a lot of flexibility with that though.


> even on a VERY good connection, RTT on the network is going to exceed your frame latency for a computer sitting in front of you (before we even get into the latency of the actual frame rendering of that remote computer). There's just not a solution for "make the light go faster".

Are you imagining the implementation as some kind of Remote Desktop setup, where no software runs on the local machine (except the Remote Desktop, of course)? This is not the state of the art for using remote developer machines: Typically some editor/IDE components run locally, for example.

> Shared resources are unpredictable.

Then don’t share them! We should do something akin to physically moving your local computer into the data center next to the servers.

> Is my code running slowly right now because I just introduced an issue, or is it because I'm sharing an env and my neighbor just ate 99% of the CPU/IO, or my network provider has picked a different route and my latency just went up 500ms?

If the software you’re developing is going to run in a shared environment then it’s better you experience these issues while developing otherwise you’re asking for a lot of “works on my machine” problems.

> And that's before we even touch the "My machine is down/unreachable

This seems less about remote and more about stability/software design?

> I don't know why and I have no visibility into resolving the issue…

This seems more about observability/software design?

> We're not talking about machines running production data.

The source code itself is something that a company should protect for multiple reasons, not the least of which is preventing an attacker from reading your source to find exploits. There are also various legal and compliance reasons for limiting the distribution of source as much as possible.

> What I do want is a fast system that I control so that I can actually tweak it as needed to develop and debug the system

I don’t understand why this is impossible on a remote machine. Can you elaborate?

> it is almost impossible to give a developer "the least access needed" to do development locally because if you know what that access was you wouldn't be developing still.

I’m sure we can imagine all sorts of setups, from free-for-all root access to so locked down it’s impossible to do work. The sweet spot is typically “you can sudo within reason but we’re logging your activity.”


> Personally - just let the developer own the machine they use for development.

I wonder if Microsoft's approach for Dev Box is the right one.


Could you elaborate on what that approach is?


laughs in "Here's a VDI with 2vCPUs and 32GB of RAM but the cluster is overloaded, also you get to budget which IDEs you have installed because you have only a couple hundred GB of storage for everything including what we install on the base image that you will never use"


> Personally - just let the developer own the machine they use for development.

Overall I agree with you that this is how it should be, but as DevOps working with so many development teams, I can tell you that too many developers know a language or two but beyond that barely know how to use a computer. Most developers (yes even most of the ones in Silicon Valley or the larger Bay Area) with Macbooks will smile and nod at when you tell them that Docker Desktop runs a virtual machine to run a copy of Linux to run oci images, and then not too much later reveal themselves to have been clueless.

Commenters on this site are generally expected to be in a different category. Just wanted to share that, as a seasoned DevOps pro, I can tell you it's pretty rough out there.


This is an unpopular take, but entirely true. Skilled at a programming language, other than maybe C, does not in any way translate to general skill with system administration, or even knowing how to correctly operate a computer. I once had to explain to a dev that their Mac was out of disk space because a. They had never removed dangling containers or old image versions b. They had never emptied the Trash.


Unpopular take? I thought it was common knowledge that computer operation and computer programming involve distinct (if overlapping) skillsets.


It's unpopular because developers want to have root access to their machines whilst being proudly ignorant to how fast the sensitive medical data or financial data they're working on can fly out of the machine.

Even when provided a means to instantiate virtual machines where they can have root access within the virtual machine, a lot of them will bitch.


> Even when provided a means to instantiate virtual machines where they can have root access within the virtual machine, a lot of them will bitch.

Well, yeah. I spent a year or so doing all my work in a VM (for other reasons) and it sucked.

> proudly ignorant to how fast the sensitive medical data or financial data they're working on can fly out of the machine

Hey, this is an easy choice! If I can have local root XOR sensitive production data on my machine, I pick local root. Keep that PII the fuck away from my disk, please!! (Hell, do that whether I have root or not.)


Sometimes I don't even use virtual envs when developing locally in Python. I just install everything that I need with pip --user and be done with it. Never had any conflicts with system packages whatsoever. If I somehow break my --user environment, I simply delete it and start again. Never had any major version mismatch in dependencies between my machine and what was running in production. At least not anything that would impact the actual task that I was working on.

I'm not recommending this as a best practice. I just believe that we, as developers, end up creating some myths to ourselves of what works and what doesn't. It's good to re-evaluate these beliefs now and then.


When doing this re-evaluation, please consider that others might be quietly working very hard to discover and recreate locally whatever secret sauce you and production share.


The only time I’ve had version issues running python code is that someone prior was referencing a deprecated library API or using an obscure package that shouldn’t see the light of day in a long lived project.

If you stick to the tried and true libs and change your function kwargs or method names when getting warnings, then I’ve had pretty rock steady reproducibility using even an un-versioned “python -m pip install -r requirements.txt” experience

I could also be a slob or just not working at the bleeding edge of python lib deployment tho so take it with a grain of salt.


I'm not going to second-guess what works for you, but Python makes it so easy to work with an ephemeral environment.

  python -m venv .venv


Yeah, I know. But then you have to make sure that your IDE is using the correct environment, that the notebook is using the correct environment, that the debugger is using the correct environment.

It's trivial to setup a venv, but sometimes it's just not worth it for me.


This is one of the main reasons I tell people not to use VSCode. The people most likely to use it are juniors and people new to python specifically, and they're the most likely to fall victim to 'but my "IDE" says it's running 3.8 with everything installed, but when I run it from my terminal it's a different python 3.8'

I watched it last week. With 4 (I hope junior) Devs in a "pair programming" session that forced me to figure out how VSCode does virtual envs, and still I had to tell them like 3 times "stop opening a damn new terminal, it's obviously not setup with our python version, run the command inside the one that has the virtual env activated".


Weird, in my experience vscode makes it very clear by making you explicitly choose a .venv when running or debugging.

When it comes to opening a new terminal, you would have the exact same problem by... running commands in a terminal, cant see how vscode related that is.


> This is not a story of whether or not to use Kubernetes for production workloads that’s a whole separate conversation. As is the topic of how to build a comprehensive soup-to-nuts developer experience for shipping applications on Kubernetes.

> This is the story of how (not) to build development environments in the cloud.

I'd like to request that the comment thread not turn into a bunch of generic k8s complaints. This is a legitimately interesting article about complicated engineering trade-offs faced by an organization with a very unique workload. Let's talk about that instead of talking about the title!


Agreed. It's actually a very interesting use case and I can easily see that K8s wouldn't be the answer. My dev env is very definitely my "pet", thank you very much!


It'd be nice to editorialize the title a bit with "... (for dev envs)" for clarity.

Super useful negative example, and the lengths they pursued to make it fit! And no knock on the initial choice or impressive engineering, as many of the k8s problems they hit likely weren't understood gaps at the time they chose k8s.

Which makes sense, given k8s roots in (a) not being a security isolation tool & (b) targeting up-front configurability over runtime flexibility.

Neither of which mesh well with the co-hosted dev environment use case.


Can someone clarify if they mean development environments, or if they're talking about a service that they sell that's related to development environments.

Because I don't understand most of the article if it's the former. How are things like performance are a concern for internal development environments? And why are so many things stateful - ideally there should be some kind of configuration/secret management solution so that deployments are consistent.

If it's the latter, then this is incredibly niche and maybe interesting, but unlikely to be applicable to anyone else.


4th paragraph in if you read the article…

> This is not a story of whether or not to use Kubernetes for production workloads that’s a whole separate conversation. As is the topic of how to build a comprehensive soup-to-nuts developer experience for shipping applications on Kubernetes.

> This is the story of how (not) to build development environments in the cloud.


I'm not sure that this really answers their question.


> Can someone clarify if they mean development environments, or if they're talking about a service that they sell that's related to development environments.

their question isn't asking anything. It's both about development environments AND a service they sell, which is dev envs.

Even so

> This is the story of how (not) to build development environments in the cloud.

that is what the article is about. Their words. The person asked what the article is about, this it it, from the authors themselves.

read the damn article


It's for running their commercial products, which are stateful and long-lived developer environments.


The article does a great job of explaining the challenges they ran into with Kubernetes, and some of the things they tried... but I feel like it drops the ball at the end by not telling us at least a little what they chose instead. The article mentions they call their new solution "Gitpod Flex" but there is nothing about what Gitpod Flex is. They said they tried microVMs and decided against them, and of course Kubernetes, the focus of the article. So is GitpodFlex based on full VM's? Docker? Some other container runtime??

Perhaps a followup article will go into detail about their replacement.


Yeah, that's fair. The blog was getting quite long, so we need to do some deeper dives in follow-ups.

Gitpod Flex is runner-based. The runner interface is intentionally generic so that we can support different clouds, on-prem or just Linux in future.

The first implemented runner is built around AWS primitives like EC2, EBS and ECS. But because of the more generic interface Gitpod now supports local / desktop environments on MacOS. And again, future OS support will come.

There’s a bit more information in the docs, but we will do some follow ups!

- https://www.gitpod.io/docs/flex/runners/aws/setup-aws-runner... - https://www.gitpod.io/docs/flex/gitpod-desktop

(I work at Gitpod)


Echoing the parent you're replying to. You built up all of the context and missed they payoff.


I thought it was fair.

>> We’ll be posting a lot more about Gitpod Flex architecture in the coming weeks or months.

Cramming more detail into this post would have exceeded the average user read time ceiling.


Still No idea what you did technically... Maybe a second post?

Did you use consul?


that is exactly what a "follow-up" is


Awesome, looking forward to hearing more. I only recently began testing out Theia and OpenVSCodeServer, I really appreciate Gitpod's contributions to open source!


What’s a “runner”?


It's a compute resource you configure to offload compute jobs from a specific platform. You can have for instance Jenkins runners that will actually execute the pipelines and leave the main node free to do UI and admin tasks.

You also have github and gitlab VCS's that have their own hosted runners for pipelines, but also enable you to configure a runner to use private resources to offload jobs to.


Sounds more to me like they need a new CTO.

And that they're desperate to tell customers that they've fixed their problems.

Kubernetes is absolutely the wrong tool for this use case, and I argue that this should be obvious to someone in a CTO-level position, or their immediate advisors.

Kubernetes excels as a microservices platform, running reasonably trustworthy workloads. The key features of Kubernetes are rollout (highly available upgrades), elasticity (horizontal scaleout), bin packing (resource limits), CSI (dynamically mounted block storage), and so on. All this relates to a highly dynamic environment.

This is not at all what Gitpod needs. They need high performance disks, ballooning memory, live migrations, and isolated workloads.

Kubernetes does not provide you sufficient security boundaries for untrusted workloads. You need virtualization for that, and ideally physically separate machines.

Another major mistake they made was trying to build this on public cloud infrastructure. Of course the performance will be ridiculous.

However, one major reason for using Kubernetes is sharing the GPU. That is, to my knowledge, not possible with virtualization. But again, do you want to risk sharing your data, on a shared GPU?


I consider Kubernetes to be an excellent framework to build these kinds of applications. The difference here is Gitpod being stateful, which is notoriously hard on Kubernetes, though easier now than ever before!

To clarify on one of your points, Kubernetes itself has nothing to do with actually setting the security boundaries. It only providers a schema to describe resources and policies, and then an underlying system (perhaps Cilium for networking, or Kata Containers for micro VMs) can ensure that the resources created actually follow those schemas and policies.

For example, Neon have built https://github.com/neondatabase/autoscaling which manages Neon Instances with Kubernetes by running them with QEMU instead. This allows them to do live migrations and resource (de)allocation while the service is running, without having to replace Kubernetes. These workloads are, as far as I understand it, stateless.


> The difference here is Gitpod being stateful, which is notoriously hard on Kubernetes, though easier now than ever before!

We've always had issues with stateful kubernetes setups. Can you share what makes it easier today than before? Genuinely interested.


You make an excellent point, and it emphasizes the need to distinguish between a typical Kubernetes setup (containers, pod/service mesh, and so on), and what Kubernetes can do in the abstract. In the extreme, the API server is just an HTTP interface for a KV store with a bit of RBAC and validation-mutation extensions.

What Neon is doing is quite a feat: Live migration (of a VM) while preserving TCP connections. It also took a lot of customization to achieve that.

But I agree that Kubernetes can indeed be used this way.

If anything, it further cements my original point about the Gitpod leadership.

The problem was never Kubernetes, but the dimwitted notion of using containers.

And then blaming Kubernetes for it: We're leaving you.


I agree on the cloud thing. Don't agree that "high performance disks, ballooning memory, live migrations, and isolated workloads" preclude from using k8s - you can still run it as base layer. You get some central configuration storage, machine management and some other niceties for free and you can push your VM-specific features into your application pod. In fact, that's how Google Cloud is designed (except they use Borg not k8s but same idea).


True! I love the idea of using K8s to orchestrate the running of VMs. With graceful shutdown and distributed storage, it makes it even more trivial to semi-live migrate VMs.

Are you aware of the limits? It must run as root and privileged?


In this scenario k8s is orchestrating the hypervisor, not VMs themselves. Hypervisor then orchestrates VMs + network (eg OVS) + other supporting functions (logs shipping, etc) on each individual “worker” node. VM scheduling/migration component needs to be completely decoupled from k8s apiserver (but itself can still run as normal k8s deployment) bc scaling kube api with unbound users is challenging. And yes, hypervisor will need to run privileged but you can limit it to worker nodes only


Why would you say that performance is bad on public cloud infrastructure?


There are things that public cloud is great for. Cost efficiency at high performance is not it. For Gitpod, performance is critical to their product offering, because any latency in a dev environment is terrible UX.

Example: What performance do you get out of your NVMe disks? Because these days you can build storage that delivers 100-200 GB/s.

https://www.graidtech.com/wp-content/uploads/2023/04/Results...

I bet few public cloud customers are seeing that kind of performance.


This is also my personal experience. I am finding that building out our own high-performance cluster of (second-hand) servers is orders of magnitude cheaper than having the same on GCP, even though we have to maintain/configure everything ourselves.


Kubernetes works great for stateless workloads.

For anything stateful, monolithic, or that doesn't require autoscaling, I find LXC more appropriate:

- it can be clusterized (LXD/Incus), like K8S but unlike Compose

- it exposes some tooling to the data plane, especially a load balancer, like K8S

- it offers system instances with a complete distribution and a init system, like a VM but unlike a Docker container

- it can orchestrate both VMs (including Windows VMs) and LXC containers at the same time in the same cluster

- LXC containers have the same performance as Docker containers unlike a VM

- it uses a declarative syntax

- it can be used as a foundation layer for anything stateful or stateless, including the Kubernetes cluster

LXD/Incus sits somewhere between Docker Swarm and a vCenter cluster, which makes it one of the most versatile platform. Nomad is also a nice contender, it cannot orchestrate LXC containers but can autoscale a variety of workloads, including Java apps and qemu VMs.


I too am rallying quickly to the Incus way of doing things. Also of note, there's an effort to build a utility to write Compose manifests for Incus workloads that I'm following very closely. https://github.com/bketelsen/incus-compose


Thanks for pointing out `incus-compose`!



I do agree with the points in article that k8s is not a good fit for development environments.

In my opinion, k8s is great for stable and consistent deployment/orchestration of applications. Dev environments by default are in a constant state of flux.

I don’t understand the need for “cloud development environments” though. Isn’t the point of containerized apps is to avoid the need for synchronizing dev envs amongst teams?

Or maybe this product is supposed to decrease onboarding friction?


It's to ensure a consistent environment for all developers, with the resources required. E.g. they mention GPUs, for developers working with GPU-intensive workloads. You can ship all developers gaming laptops with 64GB RAM and proper GPUs, and have them fight the environment to get the correct libraries as you have in prod (even with containers that's not trivial), or you can ship them Macbook Airs and similar, and have them run consistent (the same) dev environments remotely (you can self-host gitpod, it's not only a cloud service, it's more the API/environment to get consistent remote dev enviornments).


Yeah, exactly. Containers locally are a basic foundation. But usually those containers or services need to talk to one another, they need some form of auth and credentials, they need some networking setup. There's a lot of configuration in all of that. The more devs swap projects or the more complex the thing you're working on the more the challenge grows. Automating depedencies, secret access, ensuring projects have the right memory, cpu, gpu etc. Also security - moving source code off your laptop and devices and standardizing your setups helps if you need to do a lot of audit and compliance as you can automate it.


In my experience, the case where this becomes really valuable is if your team needs access to either different kinds of hardware or really expensive hardware that changes relatively quickly (i.e. GPUs). At a previous small startup I setup https://devpod.sh/ (similar to gitpod) for our MLE/Data team. It was a big pro to leverage our existing k8s setup w/ little configuration needed to get these developer envs up and running as-needed, and we could piggyback off of our existing cost tracking tooling to measure usage, but I do feel like we already had infra conducive to running dev envs on k8s before making this decision -- we had cost tracking tooling, we had a dedicated k8s cluster for tooling, we had already been supporting GPU based workloads in k8s, and our platform team that managed all the k8s infra also were the SMEs for anything devenv releated. In a world where we started fresh and absolutely needed ephemeral devenvs, I think the native devcontainer functionality in vscode or something like github codespaces would have been our go to, but even then I'd push for a docker-compose based workflow prior to touching any of these other tools.

The rest of our eng team just did dev on their laptops though. I do think there was a level of batteries-included-ness that came with the ephemeral dev envs which our less technical data scientists appreciated, but the rest of our developers did not. Just my 2c


Sarcastically, CDE is one way to move cost from CAPEX (get your developer a Mac Book Pro) to OPEX (a monthly subscription that you only need to pay as long as the dev has not been lay off)

It's also much cheaper to hire contractors and give them the CDE that can be terminated on a moment notice.


  >Kubernetes seems like the obvious choice for building out remote, standardized and automated development environments
- Is it really Obvious Choice™ though Fred?

- Hmm, let's consult the graphs.

  >Kubernetes is a container orchestration system for automating software deployment.
- It's about automating deployment Carl, not development environments!

  >Kubernetes is not the right choice for building development environments, as we’ve found.


The original k8s paper mentioned that the only use case was a low latency and a high latency workflow combination and the resource allocation is based on that. The generic idea is that you can easily move low latency work between nodes and there are no serios repercussions when a high latency job fails.

Based on this information, it is hard to justify to even consider k8s for the problem that gitpod has.


Thanks for reading the paper!


For those who are interested:

https://static.googleusercontent.com/media/research.google.c...

I am not sure what differences k8s has compare to Borg. At the concept level these are pretty comparable.


I've worked on something similar to gitpod in a slightly different context that's part of a much bigger personal project related to secure remote access that I've actually spent a few years building now and hope to open source in a few months from now. While I agree on many of the points in the article, I just don't understand how using micro VMs by itself replaces K8s unless they actually start building their own K8s that orchestrates their micro VMs (as opposed to containers in the case of k8s) ending up with the same thing basically when k8s itself can be used to orchestrate the outer containers that run the micro VMs used to run the dev containers. Yes, k8s has many challenges when it comes to nesting containers, cgroups, creating rootless containers inside the outer k8s containers and other stuff such as multi-region scaling, but actually the biggest challenge that I've faced so far isn't related to networkPolicies or cgroups but is actually by far related to storage, both when it comes to (lazily) pulling big OCI images which are extremely unready to be used for dev containers whose sizes are typically in the GBs or 10s of GBs as well as also when it comes to storage virtualization over the underlying k8s node storage. There are serious attempts to accelerate image pulling (e.g. Nydus) but such solutions would still probably be needed whether you use micro VMs or rootless/userns containers in order to load and run your dev containers.


I feel like anyone who was building a CI solution to sell to others and chose kubernetes didn't really understand the problem.

You're running hot pods for crypto miners and against people who really want to see the rest of the code that box has ever seen. You should be isolating with something purpose built like firecracker, and do your own dispatch & shred for security.


Firecracker is more comparable to container runtimes than to orchestrators such as K8s. You still need an orchestrator to schedule, manage and garbage-collect all your uVMs on top of your infrastructure exactly like you would do with containers via k8s. In other words, you will probably have to either use k8s or build your own k8s to run "supervisor" containers/processes that launch uVMs which in turn launch the customer dev containers.


For sure, but that's the point - containers aren't really good for an adversarial CI solution. You can run that shit in house on kubernetes on a VM in a simulated VR if you want. But if you have adversarial builds, you have a) builds that may well need close to root, and b) customers who may well want to break your shit. Containers are not the right solution for that, VM's get you mostly there, and the right answer is burning bare metal instances with fire after every change-of-tenant - but nobody does that (anymore), because VM's are close enough and it's faster to zero out a virtual disk than a real one.

So if you started with kubernetes and fought the whole process of why it's not a great solution to the problem, I have to assume you didn't understand the problem. I :heart: kubernetes, its complexity pays my bills - but it's barely a good CI solution when you trust everyone involved, it's definitely not a good one where you're trying to be general-purpose to everyone with a makefile.


I would argue that dev containers are more complicated than CI even though they share many of the challenges (e.g. devcontainers might need to load 10s or 100s of GBs to start and are write heavy). I would also argue that userns/rootless containers provide "enough" isolation when it comes to isolating CPU/memory/networking as well as access to the host's syscalls if you're careful enough; however when it comes to storage (e.g. max disk size that a container can use and write to, max opened files, completely hiding the host's fs from the container's, etc...), it's unfortunately still extremely limited,fs-dependent for some features, even though modern solutions (e.g. vDPA and ublk) can be used to fix that and virtualize the storage for containers.


you can run your pods in vms, with something like kata containers. Kubernetes is more a scheduler than a isolation layer. Of course it uses the cri-o runtime for containers by default and relies heavily on groups, but that is just the default


I tried doing a dev environment on Kubernetes but the fact you have to be dealing with a set of containers that could change if the base layer changed meant instability in certain cases which threw me off.

I ended up with a mix of nix and it's vm build system which is based on qemu. The issue is too tied to NixOS and all services run in the same place which forces you to manage ports and other things.

How I wish it could work is having a flake that defines certain services, these services could or could not run in different µVMs sharing an isolated linux network layer. Your flake could define your versions, your commands to interact and manage the lifecyle of those µVM's. As the nix store can be cached/shared, it can be provide fast and reproducible builds after the first build.


Have you tried https://github.com/astro/microvm.nix ? You can use the same NixOS module for both declarative VMs and imperatively configured and spawned VMs.


> the fact you have to be dealing with a set of containers that could change if the base layer changed meant instability

Can you expand on this? Are you talking about containers you create?


We've been using Nix flakes and direnv (https://direnv.net/) for developer environments and NixOS with https://github.com/serokell/deploy-rs for prod/deploys - takes serious digging and time to set up, but excellent experience with it so far.


I’ve been using Nix for the past year and it really feels like the holy grail for stable development environments. Like you said—it takes serious time to set up, but it seems like that’s an unavoidable reality of easily sharable dev envs.


Serious time to set up _and_ maintain as the project changes. At least, that was my experience. I really _want_ to have Nix-powered development environments, but I do _not_ want to spend the rest of my career maintaining them because developers refuse to "seriously dig" to understand how it works and why it decided to randomly break when they added a new dependency.

I think this approach works best in small teams where everyone agrees to drink the Nix juice. Otherwise, it's caused nothing but strife in my company.


This may be the one area where some form of autocracy has merit :-)


What's that saying about developers and herding cats? :)


Phew, it is absolutely true. Building dev environments on k8s become wasteful. To add to this complexity, if you are building a product that is self hosted on customer's infrastructure. Debugging and support also become non homogeneous and difficult.

What we have seen works especially when you are building developer centric product is expose these native issues around network, memory, compute and storage to engineers and they are more willing to work around it. Abstracting those issues leads to shift in responsibility on the product.

Having said that, I still think k8s is an upgrade when you have a large team.


Kubernetes is just combined infra admin practices. Whether we use it or not, we need to do the same things by local oriented way or vendor specific way .

1. Some operations on remote in local oriented way are time consuming and unmanageable.

2. With vendor specific way, our skill would be deprecated, having dependency to the vendors.

3. Kubernetes is not the best tools but it it popular.

As always, custom solution is the most powerful but should be replaced with more unified way for the stability of the development.


Our first implementation of brev.dev was built on top of kubernetes. We were also building a remote dev environment tool at the time. Treating dev environments like cattle seemed to be the wrong assumption. Turning kubernetes into a pet manager was a huge endeavor with long tail of issues. We rewrote our platform against vms and were immediately able to provide a better experience. Lots of tradeoffs but makes sense for dev envs.


The problem with "development environments", like other interactive workloads, is that there is a human at the other end that desires a good interactive experience with every keypress. It's a radically different problem space than what k8s was designed for.

From a resource provider productive, the only way to squeeze a margin out of that space would be to reverse engineer 100% of human developer behavior so that you can ~perfectly predict "slack" in the system that could be reallocated to other users. Otherwise it's just a worse DX, like TFA gives examples of. Not a business I'm envious too be in... Just give everyone a dedicated VM or desktop, and make sure there's a batch system for big workloads.


Kubernetes is awesome but I understand what the article is getting at. K8s was designed for a mostly homogeneous architecture when your platform requirements end with "deploy this service to my cluster" and you don't really care about the specifics of how it's scheduled.

A heterogeneous architecture with multi-tenancy poses some unique challenges because, as mentioned in the article, you get highly inconsistent usage patterns across different services. Also, arbitrary code execution (with sandboxing) can present a signifiant challenge. For security, you ideally need full isolation between services which belong to different users; this isolation wasn't a primary design goal of Kubernetes.

That said, you can probably still use K8s, but in a different way. For smaller customers, you could co-locate on the same cluster, but for larger customers which have high scalability requirements, you could have a separate K8s cluster for each one. Surely for such customers, it's worth the extra effort.

So in conclusion, I don't think the problems which were identified necessarily warrant abandoning K8s entirely, but maybe just a rethinking of how K8s is used. K8s still provides a lot of value in treating a whole cluster of computers as a single machine, especially if all your architecture is already set up for it. In addition to scheduling/orchestration, K8s offers a lot of very nice-to-have features like performance monitoring, dashboards, aggregated logs, ingress, health checks, ...


The real reason for this shift is that kubernetes moved to containerd which they cannot handle. Docker was much easier. Differential workloads is not correct to blame.

Also, there is a long tail of issues to be fixed if you do it with Kubernetes.

Kubernetes does not just give you scaling, it gives you many things: run on any architecture, be close to your deployment etc.



Most of the kubernetes providers (GKE, EKS) do not support this new shim. Even on baremetal it is possibly hard to run.


The article offers toward the end that now self-hosted customers can run their app on something other than k8s. I think this is a mistake. We're a k8s enterprise shop, and I don't want to support any more VMs. If it's not on k8s, I'm not running it. I don't want to be responsible for golden images, patching, and all the fun that comes with managing workloads outside of k8s. That's why I have k8s.

All the problems in the article also seem self-imposed. k8s can run stateful workloads just fine. Don't start and stop them. Figure out the math on how much it costs to run a container 24/7, add your margin, and pass that cost to the customer. Customer can decide to stop the containers to save $$, so the latency won't hurt, they'll accept it because they know they're saving money.


I was intrigued because the development environment problem is similar to the data scientist one - data gravity, GPU sharing, etc - but I'm confused on the solution?

Oddly, I left with a funny alternate takeaway: One by one, their clever inhouse tweaks & scheduling preferences were recognized by the community and turned into standard k8s knobs

So I'm back to the original question... What is fundamentally left? It sounds like one part is maintaining a clean container path to simplify a local deploy, which a lot of k8s teams do (ex: most of our enterprise customers prefer our docker compose & AMIs over k8s). But more importantly, something fundamental architecturally about how envs run that k8s cannot do, but they do not identify?


OP here. The Kubernetes community has been fantastic at evolving the platform, and we've greatly enjoyed being in the middle of it. Indeed, many of the things we had to build next to Kubernetes have now become part of k8s itself.

Still, some of the core challenges remain: - the flexibility Kubernetes affords makes it hard to build and distribute a product with such specific requirements across the broad swath of differently set up Kubernetes installations. Managed Kubernetes services help, but come with their own restrictions (e.g. Kernel versions on GKE). - state handling and storage remains unsolved. PVCs are not reliable enough, subject to a lot of variance (see point above), and depending on the backing storage have vastly different behaviour. Local disks (which we use to this day), make workspace startup and backup expensive from a resource perspective and hard to predict timing wise. - user namespaces have come a long way in Kubernetes, but by themselves are not enough. /proc is still masked, FUSE is still not usable. - startup times, specifically container pulls and backup restoration, are hard to optimize because they depend on a lot of factors outside of our control (image homogeneity, cluster configuration)

Fundamentally, Kubernetes simply isn't the right choice here. It's possible to make it work, but at some point the ROI of running on Kubernetes simply isn't there.


Thanks!

AFAICT, a lot of that comes down to storage abstractions, which I'll be curious to see the answer on! Pinned localstorage <> cloud native is frustrating.

I sense another big chunk is the fast secure start problems that firecracker (noted in the blogpost) solve but k8s is not currently equipped for. Our team has been puzzling that one for awhile, and part of our guess is incentives. It's been 5+ years since firecracker came out, so likewise been frustrating to see.


> We’ll be posting a lot more about Gitpod Flex architecture in the coming weeks or months. I’d love to invite you on November the 6th to a virtual event where I’ll be giving a demo of Gitpod Flex and I’ll deep-dive into the architecture and security model at length.

Bottom of the post.


For development, I made the switch to nix/flox and it’s been a game-changer.


How well does Flox work out of the box? I would really like to introduce Nix to the dev environments in my company but the struggle of maintaining nix files and flakes is too large. I've looked at DevBox and it looks quite accessible but Flox also looks like a nice way to sneak some of the Nix goodness into the company.


It works pretty well for most common packages, if a rare package/dependency fails is mostly nix problem to solve upstream.


Regardless how you edit/compile your code, you still need to debug/troubleshoot problems in production, and that is very likely to use Kubernetes. So the more reasonable approach seems to be: first, figure out how do you troubleshoot/identify/mitigate a problem in production, then reproduce it in development environment and work to fix it at daytime. When you have instrumented your app for reasonable debugging experience then using these tools on development machine becomes much easier problem, K8s or not.


The article is an excellent cautionary tale. Debugging an app in a container is one thing. Debugging and app running inside a Kubernetes node is a rabbit hole that demands more hours and expertise.


The debate in the comments about whether you should run locally is fascinating.

To the people saying ultra modern hardware could handle it: worth remembering the companies on question started on this path X years ago with Y set of technologies and Z set of experiences.

Because it made sense for Google in 2012 or whatever doesn't necessarily mean they would choose it again --or not-- given a do over (but there's basically no way back).


Make sure you need microservices-based architecture because it comes with its own complexity - a load balancer, container networking, distributed tracing, etc. If you application does not need to scale its sub-components independently, you are better off using a VM-based application. It's 10X cheaper to maintain/troubleshoot and is high performance/resources.


I was wondering if there's productivity angle too. Take Ceph vs Rook for example. If a Ceph cluster needs all the resources on its machines and the cluster manages its resources too, then moving to Rook does not give any additional features. All the 50K additional lines of code in Rook is to set up CSIs and statefulsets and whatnot just to get Ceph working on Kubernetes.


I can completely relate to anyone abandoning K8s. I'm working with dstack, an open-source alternative to K8s for AI infra [1]. We talk to many people who are frustrated with K8s, especially for GPU and AI workloads.

[1] https://github.com/dstackai/dstack


I really like dstack, keep up the great work


> SSD RAID 0

> A simpler version of this setup is to use a single SSD attached to the node. This approach provides lower IOPS and bandwidth, and still binds the data to individual nodes.

Are you sure SSD is that slow? NVMe devices are so fast that I hardly believe there's any need for RAID 0.


In AWS iirc NVMe max out at 2GB/s - I'm not sure why that's the case. I know there were issues with the PCIe controller in the past being the bottleneck, but I suspect there's something more to it than that.


> Autoscaler plugins: In June 2022, we switched to using cluster-autoscaler plugins when they were introduced.

Does anyone have any links for cluster-autoscaler plugins? Searching drawing a blank, even in the cluster-autoscaler repo itself. Did this concept get ditched/removed?


> development environments

Kubernetes has never ever struck me as a good idea for a development environment. I'm surprised it took the author this long to figure out.

K8s can be a lifesaver for production, staging, testing, ... depending on your requirements and infrastructure.


Our operations team is planning to build dev envs in k8s, but only the networked dependencies. Like a personal testing/staging where you have full control of the thing(s) you are developing and can simply leverage the rest of the stack.

Sounds sane. Am i missing anything?


> Kubernetes is immensely challenging as a development environment platform

Glad someone said it out loud. So true. Apptainer has been a far better development experience for us.


I also recently left Kubernetes. It was a huge waste of time and money. I've replaced it with just a series of services on Google Cloud Run and then using Google's Cloud Run Tasks services for longer running tasks.

The infrastructure now incredibly understandable and simple and cost effective.

Kubernetes cost us >$million in both DevOps time and actually Google Cloud costs unnecessarily, and even worse it cost us time to market. Stay off of Kubernetes as long as you can in your company, unless you are basically forced onto it. You should view it as an unnecessary evil that comes with massive downsides in terms of complexity and cost.


As far as I can tell, there actually is no AWS equivalent to GCP Cloud Run. The closest equivalents I know of are ECS on Fargate, which is more like managed Kubernetes except without Kubernetes compatibility or modern features, or AppRunner, which is closer in concept but also sorely lacking in comparable features.


wow very very interesting. I think we can discuss about it on hours.

1.) What would you think of things like hetzner / linode / digitalocean (if stable work exists)

2.) What do you think of https://sst.dev/ or https://encore.dev/ ? (They support rather easier migration)

3.) Could you please indicate the split of that 1 million $ in devops time and google cloud costs unnecessarily & were there some outliers (like oh our intern didn't add this specific variable and this misconfigured cloud and wasted 10k on gcloud oops! or was it , that bandwidth causes this much more in gcloud (I don't think latter to be the case though))

Looking forward to chatting with you!


Aren't you afraid of being now stuck with GCP?


It is just a bunch of docker containers. Some run in tasks and some run as auto-scaling services. Would probably take a week to switch to AWS as there are equivalent managed services there.

But this is really a spurious concern. I myself used to care about it years ago. But in practice, rarely do people switch between cloud providers because the incremental benefits are minor, they are nearly equivalent, there is nothing much to be gained by moving from one to the other unless politics are involved (e.g. someone high up wants a specific provider.)


How does the orchestration work? How do you share storage? How do the docker containers know how to find each other? How does security work?

I feel like Kubernetes' downfall, for me, is the number of "enterprise" features it (got convinced into) supporting and enterprise features doing what they do best: turning the simplest of operations into a disaster.


> How does the orchestration work?

Github Actions CI. Take this and make a few more dependencies and a matrix strategy and you are good to go: https://github.com/bhouston/template-typescript-monorepo/blo... For dev environments, you can add post-fixes to the services based on branches.

> How do you share storage?

I use managed DBs and Cloud Storage for shared storage. I think that provisioning your own SSDs/HDs to the cloud is indicative of an anti-pattern in your architecture.

> How do the docker containers know how to find each other?

I try to avoid too much communication between services directly, rather try to go through pub-sub or similar. But you can set up each service with a domain name and access them that way. With https://web3dsurvey.com, I have an api on https://api.web3dsurvey.com and then a review environment (connected to the main branch) with https://preview.web3dsurvey.com / https://api.preview.web3dsurvey.com.

> How does security work?

You can configure Cloud Run services to be internal only and not to accept outside connections. Otherwise one can just use JWT or whatever is normal on your routes in your web server.


> But you can set up each service with a domain name and access them that way. Are you using Cloud Run domain mappings for this or something else?

I have been converging on a similar stack, but trying to avoid using a load balancer in an effort to keep fixed costs low.


Yup domain mappings for now. There is some label support in Cloud Run but I haven’t explored it yet. You can also get the automatic domain name for a service via the cloud run tools.

Yeah I definitely want to also avoid a load balancer or gateway or end points as well for cost purposes.


One of Cloud Run's main advantages is that it's literally just telling it how to run containers. You could run those same containers in OpenFaaS, Lambda, etc relatively easily.


What stack are you deploying?


Stuff like this, just at larger scale:

https://github.com/bhouston/template-typescript-monorepo

This is my living template of best practices.


I'd investigate getting a build out to Node.js (looks like you already have this) and then just doing a simple SCP of the build to a VPS. From there, just use a systemd script to handle startup/restart on errors. For logging, something like the Winston package does the trick.

If you want some guidance, shoot me an email (in profile). You can run most stuff for peanuts.


> I'd investigate getting a build out to Node.js (looks like you already have this) and then just doing a simple SCP of the build to a VPS. From there, just use a systemd script to handle startup/restart on errors. For logging, something like the Winston package does the trick. If you want some guidance, shoot me an email (in profile). You can run most stuff for peanuts.

I appreciate the offer! But it is not as robust and it is more expensive and misses a lot of benefits.

Back in the 1990s I did FTP my website to a VPS after I graduated from Geocities.

Google Cloud charges based on CPU used. Thus my servers have no traffic, they cost less than a $1/month. If they have traffic, they are still cost effective. https://web3dsurvey.com has about 500,000 hits per month and it costs me $4/month to run both the Remix web server and the Fastify API server. Details here: https://x.com/benhouston3d/status/1840811854911668641

Also it will autoscale under load. Thus when one of my posts was briefly the top story on Hacker News last month, Google Cloud Run added more instances to my server to handle the load (because I do not run my personal site behind a CDN, it cost too much, I prefer to pay $1/month for hosting.)

Also deploying Docker containers that build on Github Actions CI in a few minutes is a great automated experience.

I do also use Google services like Cloud Storage, Firestore, BigQuery etc. And it is easier to just run it on GCP infrastructure for speed.

I also have to version various tools that get installed in the docker like Blender, Chromium, etc. This is the perfect use case for Docker.

I feel this is pretty close to optimal. Fast, cheap, scalable, automated and robust.


to be really honest , why don't you use cloudflare for blog post hosting / their storage mechanism if you really want comments hosting.

Why are you actually using google cloud for blog post hosting.

Also you said a million $ kubernetes

wait a second ,have you converted those million $ to 4$ per month

what tom foolery is this


I am demonstrating my point using simple personal projects that I can easily explain.

But I am also the Founder/CTO of Threekit.com.

I hope that makes sense now.


there was some recent HN post which showed that they didn't even use docker but rather there was some other mechanism and it was so so simple , I really enjoyed that article


yeh I have same thoughts , also if possible , bun can also reduce memory usage in very very basic scenarios https://www.youtube.com/watch?v=yJmyYosyDDM

Or just https://github.com/mightymoud/sidekick or coolify or dokku or dockify , like there are million of such things , oh just remembered kamala deploy from DHH and docker swarm IIRC (though people have seemed to forget docker swarm !)

I like this idea very much !


You know that Cloud Run is effectively a Kubernetes PaaS, right?


Google employee here. Not the case. Cloud Run doesn't run on Kubernetes. It supports the Knative interface which is an OSS project for Kubernetes-based serverless. But Cloud Run is a fully managed service that sits directly atop Borg (https://cloud.google.com/run/docs/securing/security).


Parent said "effectively", which it appears you confirm.


But it skips out on the Kubernetes part, which is the important part. :)


I guess the point is that for the OP, Kubernetes is now someone else's problem.


> You know that Cloud Run is a Kubernetes PaaS, right?

Yup. Isn't it Knative Serving or a home grown Google alternative to it? https://knative.dev/docs/serving/

The key is I am not managing Kubernetes and I am not paying for it - it is a fool's errand, and incredibly rarely needed. Who cares what is underneath the simple Cloud Run developer UX? What matters for me is cost, simplicity, speed and understandability. You get that with Cloud Run, and you don't with Kubernetes.


I haven't looked at Cloud Run pricing but running Kubernetes in the cloud is pretty cheap these days and my experience with solutions like Cloud Run in the past is that they end up becoming expensive.

Kubernetes can be as complex or as expensive as you'd like but it's also fairly possible to run a pretty bulletproof simple Kube cluster.


Maybe we misconfigured Kubernetes?

Here are my concerns:

With Kubernetes is that you need to pay for a few node just to keep it up, and then you need to pay for your nodes, no matter how much you use them.

Remember that Cloud Run charges based usage, so if a service sits unused for a while, which often happens in a heterogeneous microservices environment, you don't pay for it.

Also autoscaling is slow (Cloud Run autoscales really quickly, about as fast as your docker can be loaded and started, which for me is 1-3 seconds, where as I found Kubernetes auto-scales on the order of minutes) unless you over-provision, which is costly. This lets one scale to zero even without much of a hit.

I also ran into massive issues trying to get GPUs to work in Kubernetes - it was a driver nightmare that has wasted weeks of time collectively over the years. Whereas they are auto-provisioned properly on Cloud Run if you request them.

Lastly job systems on Kubernetes are a nightmare of configuration. The built-in scheduler cannot handle a lot of jobs but Argo also has its own issues if you actually try to use it. We've wasted weeks of effort on this. Cloud Run Tasks just skips this and is ultra fast too and handles scaling up to do a lot of jobs in such a simple fashion.

Honestly, managing Kubernetes is just overall a pain that has little benefit.

It is really hard to figure out what the benefits of Kubernetes is from my point of view. It has been a massive source of pain and costs and lost developer time.


Have folks seen success with https://earthly.dev/ as a tool in their dev cycle?


On a side note: has anybody experience with MicroK8s? I'd love to learn stories about it. I'm interested in both dev and production experiences.


Microk8s is nothing but a kubernetes distro from canonical. Personally I would use k3s because it’s a little more widespread and less opinionated in a good way.

Anyway, as always it depends on what you want to use it for.


The cloud maker is the answer of all this. qbo.io


Hi Alex Diaz from qbo, can you stop spamming links to your website all over the net? At least elaborate.


We started having a few developers have constant VSCode timeouts. We switched to GitHub devcontainers which have been great.


Damn those are really good features they could have contributed to Kubernetes.


I read this article and I still don't understand what's wrong with Kubernetes for this task. Everything you would do with virtual machines could be done with Kubernetes with very similar results.

I guess team just wants to rewrite everything, it happens. Manager should prevent that.


The cloud Maker is the answer to all this. qbo.io


You just simplified Kubernetes Management System


Leaving this comment here so I'll always come back to read this as someone who was considering kubernetes for a platform like gitpod


Remember that you can favorite posts.


Why don't they describe their new system? I feel disappointed. :(




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: