Hacker News new | past | comments | ask | show | jobs | submit login
Running servers and services well is not trivial (2018) (utcc.utoronto.ca)
156 points by kiyanwang on Feb 17, 2020 | hide | past | favorite | 74 comments



Neither is: any complex manufacturing technique, chemical or pharmacology synthesis, drilling for oil, building a mile of interstate highway, assembly of automobiles etc but I don't think someone with 2 years of experience could either cause significant process damage or process improvement at these complex and expensive activities.

I am therefore regularly shocked by how much money is poured into or otherwise spent on tech and tech workers and how laughable the result is. I've really never seen a server infrastructure at any company where I haven't shuttered in horror within a few hours of looking at design, practices, or systems themselves. When I was much more junior I was perplexed by the sudden appeal of i.e. AWS EC2 but then I've seen from the inside multinational conglomerates try to acquire and deploy servers (i.e. just the logistics that a non-technical person could handle, no software) and understand. The irony to me is, if you fail at logistics, you have near zero chance of doing anything else right, so cloud doesn't reduce your liability much unless it is fully managed like GApps or Salesforce. This should be a huge warning sign for a variety of stakeholders, but money's cheap and the livin's easy right now.


> I am therefore regularly shocked by how much money is poured into or otherwise spent on tech and tech workers and how laughable the result is. I've really never seen a server infrastructure at any company where I haven't shuttered in horror within a few hours of looking at design, practices, or systems themselves.

Yet, all these companies remain in existence.

Maybe what you consider to be "laughable" is actually "good enough" when it comes down to dollars and cents?


Existence doesn't mean much in times of quantitative easing; pick your favorite punching bag company in recent memory that is still extant. If you don't understand the economic situation backing all this it's hard to discuss your rebuttal further so we need to checkpoint there before I invest any time on more responses.

The issue is largely orthogonal to my own skillset because one needs to only view quality regressions in order to assert my point, not be the cause or hero of them (ain't philosophy generous?!). The times a 2 year experience system operator could go in and wreck havoc on a mainframe operation were not particularly common and the standards and practices (i.e. airgapped backups on tape, geographically dispersed sysplex etc) ensured business continuity were generally planned right in from the initial purchase order with the vendors' help once upon a time. In this decade people can and did launch massive financial exchanges on mongodb when it had serve and well known issues.


Good enough according to whom? Inevitably there's always some unpatched server somewhere on the network someone forgot about that let hackers access data for millions of customers.


You find the server infrastructure at every company you’ve seen to be poor, so you must have something better. If you have something objectively better then why aren’t you cashing in on this knowledge (and improving the service or product for the customers too) and improve the server infrastructure at one of these companies?


I would love to. See Dan Luu's big company pay statistics. Want to fund me to ignore that? I don't think anyone would (aka bubble, value creation detached from valuation). I'll just save what I can, try not to rock the boat too much for the people underwriting my salary in any direction good or bad, and hope to GTFO before complete burnout.


I feel this way every time somebody says this about mail. "Why not just install Postfix on a cheap VPS and run your own mail? It's easy!"

Running a mail server is easy. Running a reliable mail server with good deliverability is not. Not being spammed into the next millennium is hard. Yes, it's possible (I do it on some inexpensive OVH VPSs) but it takes time to come up with a solution that works without being too restrictive. Plus, spam is a moving target.


I am running Debian stable (exim4, spamd) on my own little hardware as my own email server out of the box for years now, and never had any issues at all besides the initial config. Spam filtering works better than commercial providers like gmx or posteo. I have unattended-upgrades installed and every few years I do a apt-get dist-upgrade. That's it, for years now.


I also use to manage exim4/spamd long ago. Part of the problem I remember is that your email (sent from your server) can be marked as spam on someone else's inbox (e.g. you are blacklisted without knowing it). So while you haven't had issues on your side, it's hard to prove that you didn't have issues outside your box. This was especially common with VPS datacentres, as their IPs routinely showed up on blacklists.


My mini server (own hardware) is on a colocation in a datacenter in the city where I live. I never had any issue with missing emails at the receiver side, nor was my IP in a blacklist.


Its not as hard as everyone makes it out to be. I ran one out of my house for 20 years. The biggest issue I had was webmail providers blacklisting entire IP ranges you were in. Especially hotmail but with a bit of communication you could get around that as well.


This is getting harder than it was even 5 years ago if you don't have decent mail volume


I quit doing it because I have an AWS account for small business side stuff and adding webmail is like 5 bucks a month.

That and there are so many nosey devices on the home LAN these days I had to create 4 subnets with isolation so the nosey IoT crap cant poke around my personal stuff


Why keeping nosey IoT crap in house? You can never isolate it reliably, these things may listen, watch and send data over wireless connection to mothership, just like cell phones. If you don't need that crap, I would dismantle that.


This article is talking about the difference between theory and practice.

I admin a small, low traffic business site for a friend. In theory, running a small PHP site should be easy.

In practice, I cannot think of a week that has gone by where everything just "worked." From script kiddies to hackers, to web-components suddenly disappearing. To YouTube audits, to random bugs that aren't reproducible.

Large providers will always have an advantadge in that, they can identify adverasrial activity before it reaches my site, and then apply what they learn to every site they operate.


For a upcoming project we're using AWS Marketplace + Cloudformation in combination with the community edition of Chef Habitat to try to solve this problem. Once someone deploys the cloudformation template, it sets up:

- Email (SES)

- SSL certificates + renewal (ACM + lets encrypt)

- Backup + Restore (RDS snapshots + AWS Backup for EFS)

- ALB (if desired)

- CDN (Cloudfront)

- Firewall

- Auto scaling database (RDS pgsql serverless) with automatic pausing

- Auto scaling storage (EFS)

You can adjust capacity by just dialing up or down the ASG, and our Elixir app auto-clusters using the Habitat ring for service discovery. Packages are upgraded when new versions are pushed to our package repository. All binaries run in jailed process environments scoped to Habitat packages, with configuration management and supervision handled by Habitat.

This is a non-container approach, focusing on VMs. However, the theory is that this will be pretty much turnkey for a scalable self-hosted product on AWS, including software updates. It's hard to say how well the theory will fall out in practice, but I'm optimistic. Avoiding a mess of microservices was fairly important in making this kind of thing possible, we have a few services, with a dominant monolith in Elixir.


This is the first I've heard about Chef Habitat in a few years. I'm curious about why you chose it. Also, how do you configure new VMs? Do you have a custom machine image, or do you use a stock image and have a startup script (via EC2 user data) that sets it up? I've spent a little time looking at Habitat docs, and I find it unfortunate that they're trying so hard to jump on the container bandwagon when apparently it also works well on bare metal or VMs.


Our project was greenfield 2.5 years ago, and in evaluating the ecosystem we ran into the work going on w Habitat after trying to figure out if there was a VM-analog to the stuff going on with k8s and containers (which was a very scary prospect to dive into.) Habitat seems to be a good tech stack for solving a variety of operational problems that you typically pull different products off the shelf and duct tape together -- particularly if you are focused on VMs.

It was a bit of a bumpy ride but the product has stabilized enough we have confidence in it and it turns out to be quite a good fit for the self hosting use-case. Basically, it's an all-in-one solution for packaging, isolation, service discovery, configuration management, and deploys. So it was relatively easy to transition our production deployment bits to a packaged up deployed solution to 3rd party servers, with all of the bells and whistles.

Our setup is a fairly minimal custom AMI that just installs the basic package dependencies for Habitat and configures the supervisor. Everything else is bootstrapped within Habitat, including our own proprietary configuration management service which allows a nice web-based GUI for configuring the ring state. The user data script just does some basic UNIX setup and DNS initialization and then does a bunch of Habitat service loads. By abstracting over all configuration via Habitat, a unified interface exists for configuring all the services across all the machines, regardless of their underlying configuration management approach, programming language, conventions, etc.


This is true to some extent. I'm building out the infrastructure for my company, and there is a lot of effort on my part required to automate stuff.

Something like kafka requires figuring out the configuration you want, putting that into Salt, adding the correct configuration options, deploying it and making sure zookeeper works, and then generating certificates and what not. It's not a simple process.

Setting up monitoring, and other things like floating IPs is a pain. Custom wrappers for Terraform scripts and other components required to deploy the systems you need to run an app. It's a lot


Ugh this comment triggers my hate for the current state of infrastructure. It's just configuration hell at this point. We thought all of these wrappers and services would make infrastructure easy, and they did to a certain degree, but it also created brand new problems that are arguably just as annoying as the old ones.

Edit: As a developer, I would rather sketch out the infrastructure that I need, then hand it off to someone whose entire job is to set up infrastructure. I work at a small startup where the devs are still doing a lot of infrastructure, and all it does is create technical debt. Without a solid team that owns infrastructural concerns, developers just end up digging themselves a grave.


I think the problems are mostly around configuration that is so complex and redundant that we need programmatic abstractions to deal with it—we need something like a general purpose programming language to DRY up and simplify that configuration, but the powers that be are clinging desperately to YAML (despite encoding a shitty, half-baked AST on top of it a la CloudFormation or “generating” YAML with text templates a la Helm) as the human/configuration interface presumably because the “it’s as easy as YAML!” is such a good marketing schtick.

A certain amount of configuration complexity will always be there, but there’s still a lot of incidental complexity that could be cut away if we just generated these config a with a general purpose language.


There are worse things than YAML. Dealing with hundreds of snow flake configuration formats that do not follow common sense rules cost me more time than working around mistakes in YAML. There are obviously some really, really stupid ideas like generating YAML with text templates but the problem isn't with YAML. It's the tool that generated the YAML file.


In case you haven't seen AWS' answer: https://github.com/aws/aws-cdk


HCL2 feels like the farthest ahead right now, if you don't want to deal with straight up python or ruby.


Isn't that the point of the article? That infrastructure actually isn't easy, and you're just hating the fact that it isn't?

Isn't that sort of like someone in Sales complaining about the "state of development" and asking why they can't just press a button and have a custom app to sell, like, tomorrow?


It doesn't feel like we've come very far at all from the first time I realized I could use nginx as a forward proxy and have the code just make all service calls to localhost. That seriously reduced the complexity for the development team.

Some days I just want a load balancer that can route traffic to whatever servers are up now and then I'd just have Docker, a service registry, a load balancer and I'd be done for a while. The easy bits should be easy, and they can't even manage that.


That's pretty much the experience I had with Consul and Traefik (containers orchestrated by Nomad).


Welcome to dev vs infrastructure vs ops vs devops.


You know that your job has abstracted into insanity when use the names of products but the sentence doesn't read like it.

Kafka requirements adding salt to keep your zookeeper happy.

JavaScript went into this stage not too many years back. "I test my react code with mocha chai cucumber in gulp with puppeteer."


In this case, I'm not sure if it's abstraction-insanity as much as having common solutions to common problems. You know what would be an insanity of abstraction? Rolling all that yourself.

Given that there are better and worse abstractions, most are not suitable for most use-cases, a lot of times simpler and more centralized is actually more stable. But when you start needing high-availability-coordination you're going to be damn happy that there are things such as Consul, etcd and Zookeeper.

As for Salt; by all means, start off by having bash scripts in source control, which at some points becomes a pain-point and you'll realize you need some kind of configuration management to not go insane.

Several of these tools are just putting a framework and common language for tasks we have to do either way.


Isn’t this stuff supposed to be solved by products like DC/OS?


DC/OS is dead, Jim. Well, kinda. Mesosphere changed their name to D2IQ and is now supporting kubernetes.

You can, however install the banzaicloud kafka operator with only a few commands and very little config and yield a production-grade kafka cluster pretty quickly.

https://github.com/banzaicloud/kafka-operator


This is why I love Heroku - it hits this really sweet spot where you can deploy apps to it fairly easily (Docker or their packaging system) and skip huge chunks of that "going to bite you eventually list" of admin tasks.


Depending on your org size, Github/Gitlab could still be cheaper than Heroku, and the software is much better than any FOSS thing you can find.


Typically you use both, github for the git repo and Heroku for automatically deploying from github.


This is certainly true, but it strikes me that the solution must be a move towards simpler primitives so that managing a cloud-based piece of software doesn't require so many moving parts.


I've mentally tried to spec out what that would look like. It's weird and very unlike how we operate today. It is not clear to me that anything I've mentally sketched out is a win even if it were magically manifested for me with no effort.

I don't think this is accidental complexity, it's essential complexity. To the extent that it seemed easier in the past, it's because we ignored some of that complexity and paid the price. Putting up a truly production-grade service is fundamentally hard.

Now, I think it probably will get easier over time as we grapple with these problems. Integrating with auth, for instance, should be easier, and that can be solved with some more code, and formal and informal standards. It's not all essential complexity. But I think a good deal of it is, or at least it is from anything remotely resembling our current perspective.


> I don't think this is accidental complexity, it's essential complexity.

I think this is the kind of essential complexity which we faced when developing operating systems. OSes are fundamentally helpers, which don't solve the application problem, but make the solving easier (a good OS makes it easier by much).

So we can and should use the results which we got from OS development.


I agree that the parts are needed and the complexity essential. What I mean is that we need higher-order primitives that abstract away and shield us from the conplexity. Just like we don't need to spend time crafting TCP packets by hand, I don't think we should have to mess with transport protocols, writing RPC calls by hand, juggling yaml files to stand up Kubernetes clusters, and so on.

I've been working on building something like this for a year now [1] and it's certainly quite different from how we do backend development normally,but that's also the point :). The tricky part is to find abstractions that are general and non-leaky so they can be leveraged to build a wide variety of software. Feedback appreciated!

[1] https://encore.dev


On the other hand Alan Kay has made powerful arguments that systems can be made much simpler if only their building blocks were slightly less simple! It's a tough question.

https://youtu.be/NdSD07U5uBs?t=319


Good point, I didn't necessarily mean simpler primitives but rather "higher-order" primitives. We need to move to a higher level of abstraction on top of a new layer of primitives (even if those primitives themselves are more complex than the ones we have today).


You still need to consider the functional aspects of the service, as well. Perhaps you're using some critical software that suddenly has a very bad CVE. It turns out that the only developer working on it was hit by a bus, and now the magical connector between your git server and your LDAP server has no functional replacement (or the replacement isn't compatible with your current setup). No amount of better abstractions will prevent a painful midnight page.

There will always be a meaningful amount of effort involved in supporting your own systems, and that cost (be it in ongoing maintenance time, or in lost time because your own system failed) _can_ always be higher than paying to outsource the work to a trusted third party.


I agree! But isn't that the beauty in abstractions? If you are using an unmaintained file system with a CVE, the beauty of the POSIX interface means you can just migrate to another file systems and basically all applications would just continue working. I'm not advocating for having nobody at the wheel to notice and address security vulnerabilities, just that those concerns should be isolated from the application code to a much greater extent than they are today.


I think what is needed is a statically typed systems-integration language. Problems are caused when there is nothing to "type-check" that the values you write into config-files are references of objects of correct type.

Static typing would also support editors that let you choose from allowed values.

Maybe such a language exists? If not, why not?


This already sort-of exists in the form of Nix/NixOS/NixOps. Entire system configuration is deterministic and is specified in a configuration file, and it has its own little functional DSL called Nix that is used to specify system configs and how to build software packages.

Just started using it and I think it's probably one of the best software ensembles I've ever used in my career. Completely knocks Docker, K8S etc. out of the water.


Yes, the Nix ecosystem makes running both servers and services much easier than any other tools I have used.

I describe the deployment I run with it here: https://news.ycombinator.com/item?id=21468506

Nix is not typed (it would be slightly better if it were), but everything is evaluated before it hits your servers which allows for lots of static checks.

I started writing a tutorial on NixOps here if you want to learn it: https://github.com/nh2/nixops-tutorial (only has 1 part so far, I'd like to show how to bootstrap a Consul cluster and distributed file system next).


System integration is a programming, or a kind of programming. With its own features (like limited expressive power in some places - see e.g. dhall language), traditions, history etc. but still - you ought to be able to use the classical programming toolkit to solve system integration problems.

So you should ask, can you use the common languages - and if not, why?


> can you use the common languages - and if not, why?

Good question. I do not know the answer to that. But I think there must be some answer since there are all these different configuration languages, Ant, Gradle etc. They must have been created for a reason.


Chef at least provides a partial approach here, where components essentially define an api that you can define mock and/or integration tests for.

Ansible is not as 'data-structure' oriented, so testing that is more difficult, but each element (modules, libraries, roles, playbooks) can be written to be fairly environment-neutral, which you could then use a test a fixture like BATS to easily check for regressions.

To answer your 'type-checking' expectation, its generally on the developer to implement those tests. In my experience, its rare that those are implemented in publicly available devops libraries, even though we have most of the tools to do so. It's definitely an area where the practice can improve.


I have a sinking feeling that the whole business of writing huge config files (semi-)manually (templating, #include, and YAML-references that reference things from god knows where) is... too low-level? Too imperative? It feels like if one was manually laying out the vtables and the unwinding info in assembly language: sure, the macros will go a long way helping you, but maybe the better way would be to have an actual high-level declarative description that would get translated down to metal. And "high-level" doesn't mean, let's use an analogy, M4, it means Gradle.


Any statically types, general purpose programming language would do the trick—just generate YAML. I’ve been playing with typed-Python and it works really well. The only issues are that Python’s type system is pretty immature (can’t represent JSON, can’t constrain generic types to a particular interface/protocol a la Rust traits, type annotations in a class share the same namespace as member variables, etc). I would like to try Go for this, but it would be a fair bit more imperative.



Like visudo, for everything! That would be cool.


Isn't it Go?


This is a fair question if you’re lucky enough not to be familiar with devops tooling.

No, it’s not Go. Go is a general purpose programming language. What parent poster wants is more like Terraform, but better (I assume). With semantics that reflect devops actions.

You might write that tool in Go :) but the users wouldn’t necessarily know. Just like a gamer doesn’t need to know about C++.


You could use Go to generate Terraform configs. I’m using Python to generate CloudFormation and it’s an order of magnitude easier and more maintainable than vanilla CloudFormation, especially for large integrations.


Real account of just how "not trivial":

https://nickcraver.com/blog/2016/02/03/stack-overflow-a-tech...

That said, thanks to VPS and Raspberry Pi, it's fun and extremely inexpensive to privately deploy and maintain semi-reliable personal services. Anything beyond that requires immense planning, expense, and dedication.


It's a good article and basically correct, I don't agree with the conclusion "why not pay someone else to do it", it may scratch your itch in the short term, but in the long term you have no agency if you stuck all your fingers into the finger-trap of github actions, metadata and other EEE. It's literally microsoft at the wheel as well.

Similarly free CI services are very attractive but when they stop being free, or die, and you have a lot of investment in tests specific to their infrastructure spread over all your apps and again no agency to keep it going yourself, it's less attractive.

"Why not pay someone else to do it" glosses the privacy and security results are not the same when you pass all your email or IP to a large foreign company who may compete in some of your markets, compared to doing it in-house.

Yes it can be unexpectedly difficult to do even a small thing securely and well. But it doesn't mean that it's not the right approach.


Does anyone here have any experience with JuJu? From how its marketed, it may alleviate much of the burden of setting up infrastructure (especially if using charms provided by the charm store).


Bonus points for understatement in title that caught me at just right time to lol.

Will skip reading comments because I know some k8 fan boy or some other type of religious enthusiast will insist all of it is solved.


Like in other business domains this is a "make or buy" decision.

To decide of course you need enough data about the situation at hand:

- initial costs

- keep system running costs

- other pro/con items like "availability" (e.g. you can use a local server even if there is issues with your companies internet access) that can be prioritized and rationalized with methods like the cost-utility analysis or the cost-benefit analysis

I think it is hard to give general advice on what to do in such cases.


to expand on this, not much of what's done in managing software applications is "easy". people so often talk about spinning up VMs and services at scale and forget that even running your own well maintained blog is a big effort


Static site generators make it pretty darn easy, though, especially if you serve it off S3. You don’t have to think about that site every year if you don’t want to, and it’s generally fine for blogs.


I get a feeling about this as the one described in the article about git servers. Yes, static site generators make it easy to post new content written in plain text, markdown, and other markup languages. However, it's the initial configuration that can be tough. I spent a year playing around with Hugo. Along the way, I learned a little bit about HTML, CSS, SCSS, go-template language, shortcodes, configuration through toml, toml/yaml front matter, reStructuredText, markdown, and git submodules. When my blog was Jekyll-based, I had to maintain gems that were compatible with the version of Ruby & Jekyll used by GitHub.

None of this stuff is easy when you're starting out. Even when you've become used to whatever environment you've created, there are a lot of moving parts to monitor and manage.


Yeah, I’ve found the purpose built blog generators I’ve played with to be a bit overly complicated for what they do. I’ve found Frozen-Flask to be more straightforward, and it has the advantage of being switchable back to a dynamic site fairly easily. An alternative would be wgetting Wordpress or similar.


Eh, let's not get carried away. A blog requires almost zero maintenance (assuming it's "just" a blog).

I've been running mine on a $5 Digital Ocean droplet for several years without any fuss. Maintenance boils down to logging in once a year or so and installing whatever updates are required (which I'm a few years behind on...). It's survived a few HN front page posts even with this unoptimized, non-scalable, sqlite backed ramshackle setup all without breaking a sweat.


It is quite likely your blog is serving spam in the best case and warez in worst


What do people use for 2FA if you only have 10-20 employees?


2FA is just something you know and something you have. I have a Yubikey with a RSA authorisation key (+enc and signing keys too, but that's irrelevant) in it that I've hooked ssh-agent and GPG to. It is the only key accepted by my servers. Obviously, disable password login. The key has a password, which is the something I know.


But that doesn't help if I want that 2FA key to access their mail, Gitlab instance, AWS servers, etc. What do I do when someone leaves and I want to be able to revoke it and take over their repositories and accounts?

If I'm a 1000 person company, I've got resources to spend on my corporate 2FA infrastructure.

What do I do when it's 10 people? 20 people? 50 people?


When the company I was working for was small, we had ssh 2fa via duo security, and used g suite, with mandatory 2fa, and got as much as possible set to do SSO via g suite. G suite isn't great, but there are/were a lot of hooks to login with it, so that was nice; and these days, it has a sane way to force 2fa (when I did it, setting the org to mandatory 2fa meant your new users couldn't login, because they hadn't set up 2fa because they couldn't login; thanks google).

We self-hosted git though (using gitolite for access control), running servers was a core competency for the team, so having a little baby server on the side that just dealt with text files for 50-100 people wasn't a big deal. It was running on a mac mini at the CEOs house until he forgot to pay his cable bill once and we couldn't push code for a day.


The idea of Google having control over my SSO is NOT appealing given their current track record.


SSO probably involving SAML tied to a central “source of auth truth” like LDAP or G Suite or an AD/office365. when someone leaves you kill them in 1 place and access is removed everywhere. bonus points if you tie in a higher layer 3rd party auth API offering like onelogin/centrify/etc that your internal apps can use.

edit to add costs: a small company can do this for 3-5 bucks/user/month. that kind of cost is doable for a small shop, and worth it.


> you kill them in 1 place and access is removed everywhere

But how do I now take over their repositories and email, for example?

> edit to add costs: a small company can do this for 3-5 bucks/user/month. that kind of cost is doable for a small shop, and worth it.

Do you have a concrete reference? I really don't want to use a Google-based system. Microsoft could be an answer even if I reflexively cringe at that--they at least seem to be able to deal with businesses properly.

I'm really not averse to paying money for this, but it needs to be seamless enough that we can use it from the CEO to the receptionist.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: