For the purposes of this exercise presume that our theoretical on-call process is no worse than Google's SRE structure: You are on-call for a 12 hour shift that is more or less aligned with your waking hours, and you are compensated extra for the time you are on-call outside of normal working hours, whether or not you are called in. You are on-call at most one week per month, on average, and usually less.
You are on-call for a 12 hour shift that is more or less aligned with your waking hours
I suppose if you're Google they can theoretically make it so it's more aligned with your waking hours? Do they do it? Most companies don't or can't. I.e. it's _less_ aligned.
you are compensated extra for the time you are on-call outside of normal working hours, whether or not you are called in
How much? Way too many on-call processes in which this is nothing but a few dollars to be able to say "see, we do pay for this, even when you're not called!". As in, way not enough for the number being on-call does to how you go about your day. Always on edge, always awaiting that call / alert that requires you to drop whatever you are currently doing. Preventing you from actually doing/starting certain things.
You haven't even mentioned the expected reaction and resolution time and that alone can make a huge difference.
You are on-call at most one week per month, on average, and usually less.
Great, only one week out of four /s That's crazy if you ask me. Going back to preventing you from going about your day in a normal way. There's no "doing on-call well" in how you describe it.
Google staffs SRE teams as either 8 in one location/TZ or two geographically distributed teams of 6 -- often some pairwise combination of U.S., Europe, and Australia to accommodate reasonable on-call shifts.
The on-call compensation varies depending on what tier of service they're offering. Tier 1 (5 minute response time) is 2/3 of your effectively hourly pay for on-call time outside of local business hours and 1/3 for tier 2 (30 min response time). Or time off in lieu.
Note that this is at a minimum, I know some teams with 10-12 folks per location. That just also has downsides since you can end up oncall once a quarter which most people in the role don't like since the extra vacation is nice.
It often isn't to make things go faster for an individual user (often times the driving factor of latency is not computation, but inter-system rpc latency, etc.). The value is to bin-pack processing more requests into the same bucket of CPU.
That can have latency wins, but it may not in a lot of contexts.
But, it is a function of what you believe the future will be (and your risk tolerance).
If you have a higher risk tolerance, you will buy fewer futures. If you believe the next year will be dryer than normal, you will buy more futures than normal. If you believe your crop is likely to be better/more reliable than normal, you will buy fewer futures.
> If you believe the next year will be dryer than normal, you will buy more futures than normal.
The point is that you, the farmer, don't need to take a view on whether the next year will be drier than normal. You just buy $X worth of rainfall futures.
The same way you shouldn't buy more flood insurance if you think the next year will be exceptionally wet. You can't really predict that, after all. You should buy flood insurance roughly up to the value of restoring your house after a flood, and you should hope the insurance market is healthy enough that the cheapest provider of that insurance offers you a price that reflects the expected value of the insurance plus a small markup.
> The point is that you, the farmer, don't need to take a view on whether the next year will be drier than normal. You just buy $X worth of rainfall futures.
And I'll reiterate, this is a function of your risk-aversion/efficiency. One would expect, for example, climate change to increase the price of weather futures as extreme/problematic weather events become more likely. It's often difficult to see the impact of these changes on the scale of a single farmer, but in aggregate lots of farmers do a market make.
> You should buy flood insurance roughly up to the value of restoring your house after a flood, and you should hope the insurance market is healthy enough that the cheapest provider of that insurance offers you a price that reflects the expected value of the insurance plus a small markup.
And the insurance companies have a small army of actuaries who make sure that the prices they provide take into account conditions like the relevant risk factors of where your home is. This is instead of a betting market style concept, where you could instead imagine every individual actuary as a potential insurer.
> The point is that you, the farmer, don't need to take a view on whether the next year will be drier than normal. You just buy $X worth of rainfall futures.
Sure, but if I, a non-farmer market player that couldn't give two fucks what the market is even about, can predict that the next year will be dryer than normal, and to what degree, better than anyone, I can make money buying up however many of these futures I can afford. It works even better if I can actually make the weather more dry somehow.
This, I believe, is called "providing liquidity to the market", but curiously, if I tried that with flood insurance, I'd just be guilty of insurance fraud.
It also wasn't, as far as I know, every strictly enforced. There were folks when I joined (which was when the L5 requirement still existed, but it was in its way out) who had been L4 for like a decade.
Right, while there was a "growth" expectation for L4s written into the SWE job ladder, there were no fixed timelines. Enforcement varied from org to org: at least one of my previous orgs periodically conducted talent reviews, specifically looking at cases like long-tenured L4s to decide whether to intervene.
That was before the layoffs started. One of my by then ex-reports, who was a very talented and knowledgeable but not at all career-focused long-time L4s got laid off in one of the rounds. :(
If you aren't using a monorepo, you need some versioning process, as well as procedural systems in place to ensure that everyone's dependencies, stay reasonably up to date. Otherwise, you end up deferring pain in really unwanted ways, and require sudden, unwanted upgrades through api incompatibility due to external pressure.
This also has the downside of allowing api-owning teams to make changes willy-nilly and break backwards compatibility because they can just do it behind SemVer, and then clients of the api need to own the process of migrating to the new version.
A monorepo fixes both of these: you cannot get out of sync, so it is the api-owning team's responsibility to upgrade clients, since they can't break the API otherwise. Similarly, you get a versioning process for free, and clients can never be using out of date or out of support versions of a dependency.
Services work approximately the same either way, since you can't assume synchronous upgrades across service/rpc boundaries anyway.
The scale of cloud data centres reflects the scale of their customer base, not the size of the basket for each individual customer.
Larger data centres actually improve availability through several mechanisms: more power components such as generators means the failure of any one is just a few percent instead of a total blackout. You can also partition core infrastructure like routers and power rails into more fault domains and update domains.
Some large clouds have two update domains and five fault domains on top of three zones that are more than 10km apart. You can’t beat ~30 individual partitions with your data centres at a reasonable cost!
I provided three different references. Despite the massive downvotes on my comment I guess by Google engineers, as a troll...:-)I take comfort on the fact nobody was able to advance a reference to prove me wrong.
It is true that the nomenclature "AWS Availability Zone" has a different meaning than "GCP Zone" when discussing the physical separation between zones within the same region.
It's unclear why this is inherently a bad thing, as long as them same overall level of reliability is achieved.
The phrase "as long as the same overall level of reliability is achieved" is logically flawed when discussing physically co-located vs. geographically separated infrastructure.
In my experience, the set of issues that would affect 2 buildings close to each other, but not two buildings a mile apart, is vanishingly small, usually just last mile fiber cuts or power issues (which are rare and mitigated by having multiple independent providers), as well as issues like building fires (which are exceedingly rare, we know of, perhaps two of notable impact in more than a decade across the big three cloud providers).
Everything else is done at the zone level no matter what (onsite repair work, rollouts, upgrades, control plane changes, etc.) or can impact an entire region (non-last mile fiber or power cuts, inclement weather, regional power starvation, etc.)
There is a potential gain from physical zone isolation, but it protects against a relatively small set of issues. Is it really better to invest in that, or to invest the resources in other safety improvements?
I think you're undermining the seriousness of a physical event like a fire. Even if the likelihood of these things is "vanishingly small", the impact is so large that it more than offsets it. Taking the OVH data center fire as an example, multiple companies completely lost their data and are effectively dead now. When you're talking about a company-ending-event, many people would consider even just two examples per decade as a completely unacceptable failure rate. And it's more than just fires: we're also talking about tornados, floods, hurricanes, terrorist attacks, etc.
Google even recognizes this, and suggests that for disaster recovery planning, you should use multiple regions. AWS on the other hand does acknowledge some use cases for multiple regions (mostly performance or data sovereignty), but maintains the stance that if your only concern is DR, then a single region should be enough for the vast majority of workloads.
There's more to the story though, of course. GCP makes it easier to use multiple regions, including things like dual-region storage buckets, or just making more regions available for use. For example GCP has ~3 times as many regions in the US as AWS does (although each region is comparatively smaller). I'm not sure if there's consensus on which is the "right" way to do it. They both have pros and cons.
One of the vanishingly small set of issues I mentioned.
It is true, and obvious, that GCP and AWS and Azure use different architectures. It does not obviously follow that any of those architectures are inherently more reliable. And even if it did, it doesn't obviously follow that any of the platforms are inherently more reliable due to a specific architectural decision.
Like, all cloud providers still have regional outages.
That concept is useful when the scale of things you have is the same order of magnitude as the rate of failure. But we clearly don't have that here, because even at scale, these events aren't common. Like I said, there have been, across all cloud providers, less than a handful over a decade.
Like, you seem to be proclaiming that these kinds of events are common and, well, no, they aren't. That's why they make the top of HN when they do happen.
It is true that every cloud provider uses some edge/colo infra, but it is also not true that most (or even really any relevant) processing happens in those colo/edge locations.
And limiting to just outside the US, both aws and Google have more than ten wholly owned campuses each, and then on top of that, there is edge/colo space.
Only if the developer is being judged on the thing. If the tool is being judged on the thing, it's much less relevant.
That is, I, personally, am not measured on how much AI generated code I create, and while the number is non-zero, I can't tell you what it is because I don't care and don't have any incentive to care. And I'm someone who is personally fairly bearish on the value of LLM-based codegen/autocomplete.
Thus, it's not clear that any harm was caused because the right wasn't clearly enshrined and had the police known that it was, they likely would have followed the correct process. There was no intention to violate rights, and no advantage gained from even the inadvertent violation of rights. But the process is updated for the future.
For the purposes of this exercise presume that our theoretical on-call process is no worse than Google's SRE structure: You are on-call for a 12 hour shift that is more or less aligned with your waking hours, and you are compensated extra for the time you are on-call outside of normal working hours, whether or not you are called in. You are on-call at most one week per month, on average, and usually less.
reply