Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I work in SRE and the way you describe it would give me pause.

The first is that SRE team size primarily scales with the number of applications and level of support. It does scale with hardware but sublinearly, where number of applications usually scales super linearly. It takes a ton less effort to manage 100 instances of a single app than 1 instance of 100 separate apps (presuming SRE has any support responsibilities for the app). Talking purely in terms of hardware would make me concerned that I’m looking at an impossible task.

The second (which you probably know, but interacts with my next point) is that you never have single person SRE teams because of oncall. Three is basically the minimum, four if you want to avoid oncall burnout.

The last is that I don’t know many SREs (maybe none at all) that are well-versed enough in all the hardware disciplines to manage a footprint the size we’re talking. If each SRE is 4 racks and a minimum team size is 4, that’s 16 racks. You’d need each SRE to be comfortable enough with networking, storage, operating system, compute scheduling (k8s, VMWare, etc) to manage each of those aspects for a 16 rack system. In reality, it’s probably 3 teams, each of them needs 4 members for oncall, so a floor of like 48 racks. Depending on how many applications you run on 48 racks, it might be more SREs that split into more specialized roles (a team for databases, a team for load balancers, etc).

Numbers obviously vary by level of application support. If support ends at the compute layer with not a ton of app-specific config/features, that’s fewer folks. If you want SRE to be able to trace why a particular endpoint is slow right now, that’s more folks.



> The last is that I don’t know many SREs (maybe none at all) that are well-versed enough in all the hardware disciplines to manage a footprint the size we’re talking. If each SRE is 4 racks and a minimum team size is 4, that’s 16 racks. You’d need each SRE to be comfortable enough with networking, storage, operating system, compute scheduling (k8s, VMWare, etc) to manage each of those aspects for a 16 rack system. In reality, it’s probably 3 teams, each of them needs 4 members for oncall, so a floor of like 48 racks. Depending on how many applications you run on 48 racks, it might be more SREs that split into more specialized roles (a team for databases, a team for load balancers, etc).

That's vastly overstating it. You hit nail in the head in previous paragraphs, it's number of apps (or more generally speaking ,environments) that you manage, everything else is secondary.

And that is especially true with modern automation tools. Doubling rack count is big chunk of initial time spent moving hardware of course, but after that there is almost no difference in time spent maintaining them.

In general time per server spent will be smaller because the bigger you grow the more automation you will generally use and some tasks can be grouped together better.

Like, at previous job, server was installed manually, coz it was rare.

At my current job it's just "boot from network, pick the install option, enter the hostname, press enter". Doing whole rack (re)install would take you maybe an hour, everything else in install is automated, you write manifest for one type/role once, test it, and then it doesn't matter whether its' 2 or 20 servers.

If we grew server fleet say 5-fold, we'd hire... one extra person to a team of 3. If number of different application went 5-fold we'd probably had to triple the team size - because there is still some things that can be made more streamlined.

Tasks like "go replace failed drive" might be more common but we usually do it once a week (enough redundancy) for all servers that might've died, if we had 5x the number of servers the time would be nearly the same because getting there dominates the 30s that is needed to replace one.


Noteworthy: the number of apps isn't affected by whether the machines are in your datacenter or Amazon's.


I would call what you’re describing Datacenter Operations, with the exception of PXE boot.

You could have SRE do it, but most places don’t because you can get someone to swap a dead drive for way cheaper (it’s not really a complicated operation).

That growth of SRE teams comes from wanting reliability further up the stack. If you’re not on AWS, there’s no Aurora so someone has to be DBA to do backups, performance monitoring, configuring failovers for when a disk dies and RAID needs to rebuild, etc. Same for network, networked storage, yada yada


So your definition of SRE is anybody that works on infra?


> The first is that SRE team size primarily scales with the number of applications and level of support. It does scale with hardware but sublinearly, where number of applications usually scales super linearly. It takes a ton less effort to manage 100 instances of a single app than 1 instance of 100 separate apps (presuming SRE has any support responsibilities for the app). Talking purely in terms of hardware would make me concerned that I’m looking at an impossible task.

Never been an SRE but interact with them all the time…

My own personal experience is there is commonly a division between App SREs that look after the app layer and Infra SREs that looks after the infrastructure layer (K8S, storage, network, etc)

The App SRE role absolutely scales with the number of distinct apps. The extent to which the Infra SRE role does depends on how diverse the apps are in terms of their infrastructure demands


Yeah, that’s valid, there are a few common layouts for SRE. I would call what you’re describing a horizontal layout (each team owns a layer for all apps that use that layer).

It sort of comes back to support levels. Your Infra SRE teams stay small if either a) an app SRE team owns application specific stuff, or b) SRE just doesn’t support application specific stuff. Eg if a particular query is slow but the DB is normal, who owns root causing that? Whoever does needs headcount, whether it’s app SRE, infra SRE or the devs.


Many people assume that companies need or want global enterprise level of management of infrastructure or 24/7 support. That's simply not the case. Many small and mid-sized companies just need their applications to run. There is no CTO on the board and nobody else really cares where the stuff runs if it fits a certain budget, is available enough to not cause major disruptions and is responsive enough to not cause complaints. Some companies may care about a certain level of compliance/ security and whether their admins/ DevOps people seem to be in agony most of the time but of those there aren't many. That's also a reason why the EU introduced directives such as NIS2, DORA, CRA, CER, even the now 10 year old GDPR and more.

Most companies I have seen have never updated the BIOS of their servers, nor the firmware on their switches. Some of those have production applications on Windows XP or older and you can see VMware ESXi < 6.5 still in the wild. The same for all kinds of other systems including Oracle Linux 5.5 with some ancient Oracle DB like 10g or something, that was the case like 5 years ago but I don't think the company has migrated away completely to this day.

Any sufficiently old company will accrete systems and approaches of various vintages over time only very slowly ripping out some of those systems. Usually what happens is that parts of old systems or old workarounds will live on for decades after they have been supposedly decommissioned. I had a colleague who was using CRT monitors in 2020 with computers of similar vintage, probably with Pentium III or early Pentium IV, because he had everything set up there and it just worked for what he was doing. I don't admire it, yet that stuff works and I do respect that people don't want to replace expensive systems just because they are out of support, when they do actually work and they have people taking care of them.


Totally, but then you probably don’t want SREs. If you’re okay with 99% availability (~7 hours of downtime a month assuming 24x7 goal), you can get by with much cheaper staffing and won’t have to deal with the turnover from SREs who get bored.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: