Rare things become common at scale (2014)

AnotherGoodName · 2024-05-21T15:21:06 1716304866

This is also largely the answer to why your weekend garage project doesn't need 1000+ developers that an actual in production project needs. At scale every edge case is a bug that someone has to deal with.

Not just the hard crash software/hardware edge cases either. Regulatory edge cases, abuse vector edge cases, localization and accessibility edge cases will all need to be dealt with.

I remember people exclaiming 'musk is right, Twitter doesn't need 1000s of developers' only for Twitter to start failing a lot of regulatory and abuse vector edge cases shortly after their layoffs. It's no good judging business needs based on a basic garage implementation of Twitter. The real world and a few billion users makes for an entirely different set of problems.

nsguy · 2024-05-21T18:23:15 1716315795

Your garage project also doesn't need to support 100's of random features some product manager thought are critical but nobody uses. Quality issues are exponential to the complexity of the software. Out of those 1000's of engineers there will also lot of variability in terms of productivity and quality. Some engineers are likely doing most of the heavy lifting. Some spend all their time in meetings. Some engineers might be working on their novel or startup while sitting in the office. Extremely unlikely they're all operating at the same level. With 1000's of engineers there's a lot of effort synchronizing those engineers, communication overhead, people stepping over each other's work etc. The mythical tower of babel: https://en.wikipedia.org/wiki/Tower_of_Babel

Lots of reliable/production software was built by teams much smaller than 1000's. Things like operating systems (e.g. the Linux kernel), compilers, tools or libraries, come to mind. Some even by single people.

That's not to say that all weekend garage projects are better than what 1000+ developer teams produce but don't knock the ability of small teams to do amazing things. Twitter isn't exactly the pinnacle of engineering accomplishment either. I'm pretty sure Twitter suffered from bloat similar to many other large tech companies. Elon taking a sledgehammer to that is probably not the best approach though. That said some people were saying the whole thing will fall apart in days, and it didn't.

[EDITed for typo]

compiler-guy · 2024-05-21T16:11:36 1716307896

Such scale also creates opportunities.

If you have 10 servers, a developer can't spend a couple of months to get a 0.1% performance boost--the benefit will never cover the cost. If you have 100,000 servers, it just might. If you have 1,000,000 servers, it almost certainly is worth an entire team looking for that much performance, year in and year out.

This is true of performance, avoiding downtime, or whatever else. You could never justify such an expense at a startup, but with sufficient scale they pay for themselves many times over.

closeparen · 2024-05-21T17:12:23 1716311543

A large company's million servers are going to be partitioned across thousands of separate applications. The biggest ones do benefit from performance work, but a large portion of the server count is going to be a long tail of services that aren't individually worth that much effort.

vlovich123 · 2024-05-21T17:57:02 1716314222

They would rarely be partitioned because of all the waste - workloads are typically placed cotenant into 1 machine (VMs, containers, normal multi-processing, etc). But I believe you’re right that cumulatively there would be a lot of long tail services that consume resources and the cumulative inefficiencies aren’t being optimized because it’s work that doesn’t scale (i.e. you’d have to fix too many performance issues on services no one cares to make a visible overall dent). Not sure why you’re downvoted though - it’s an astute observation.

pessimizer · 2024-05-21T16:49:49 1716310189

> I remember people exclaiming 'musk is right, Twitter doesn't need 1000s of developers' only for Twitter to start failing a lot of regulatory and abuse vector edge cases shortly after their layoffs. It's no good judging business needs based on a basic garage implementation of Twitter. The real world and a few billion users makes for an entirely different set of problems.

It's strange this is the lesson you took from Twitter. Twitter fired 90% of their developers, and almost nothing changed. Twitter is a site that exists in the real world, at scale. People don't even talk about the fact that Twitter fired 90% of their devs anymore, they just began firing devs themselves.

MrDarcy · 2024-05-21T17:14:30 1716311670

Elon Musk fired Twitter employees. Have you not noticed how the site has deteriorated technically? I personally know people who have suffered abuse on Twitter, reported the abuse, and had no action taken because Elon fired the trust and safety org.

4hg4ufxhy · 2024-05-21T17:34:34 1716312874

Do you also know personally people who reported it and had action taken prior?

MrDarcy · 2024-05-21T20:58:50 1716325130

thfuran · 2024-05-21T17:50:54 1716313854

Even if it did, that's not really evidence that it'd take nearly as many people to run a Twitter clone at Twitter scale. It may well, but suddenly losing 90% of personnel is some serious bus factor woes.

canoebuilder · 2024-05-21T17:54:01 1716314041

If reading something is hurting your feelings, you can stop reading it.

Twitter even provides mute, block, and whatnot functionality to prevent specified things from even showing up in your line of sight to begin with. And if the app is really bothering you, you can always set it down and go outside, take a walk, meet somebody new, do something that will put a smile on your face on your deathbed.

Lumping in mean comments online, with actual abuse is approaching risible. Words have meanings, we shouldn’t dilute or distort them.

By Twitter “not taking action,” sounds like your friend is upset that he or she can no longer co-opt the proprietors of the site into enacting punitive measures on people who draw his or her ire.

Maybe some mean things were said or whatever, but at the end of the day it’s just text on a screen isn’t it? And there’s a lot more to life than text on a screen, isn’t there?

It’s also weird how you mention the technical functioning of the site, then bring up the “Trust & Safety Org” when the legacy of “Trust & Safety” is a small cabal with extremist views arbitrarily deciding what information to censor and suppress based on their own viewpoints, whims, and influence from government agencies.

That has nothing to do with the technical functioning of the site which is a matter of reproducible, specifiable, determinate functions implemented in computer code to produce a useful product. The kind of thing that really turns the mind of an autist on.

P.S. Not to be too blasé about your friend, mean words can be an issue, especially an ongoing pattern, but anonymous strangers online seems like less of an issue than irl, and was this really an issue where block or mute wasn’t sufficient? How so?

petsfed · 2024-05-21T18:38:28 1716316708

Man, if you can't see the difference between "so-and-so called me a mean name" and "1000 strangers all knocked on my door just to tell me, in excruciating detail, how they wish my children were raped and murdered", I don't know what to tell you.

X's systems for block and mute require the abuse to occur before you have an avenue to respond. Considering that all you need to get an X account is an email account, it's a pretty low bar for brigading. And that's to say nothing about organized campaigns to falsely report an account for abuse.

For individuals, I suppose you can make some kind of argument that those tools are sufficient, but if you're the poor social media manager for some township or minor government agency that draws the ire of the internet hate machine, you have to deal with all the abuse that goes with it. You are barred by the constitution from blocking people (and rightly so), and you have no real power to prevent them from creating sock puppet accounts to continue the abuse. PTSD is pretty common amongst (former, since they fired them all) twitter content moderators, because being consistently exposed to that stuff can eventually be pretty traumatizing.

Atotalnoob · 2024-05-22T19:34:32 1716406472

“Mean words” are a small part of what trust and safety does.

CSAM, beheadings, videos of the worst things imaginable are what trust and safety deal with on a daily basis.

brabel · 2024-05-21T16:19:03 1716308343

Do you have more information about the Twitter abuse vector edge cases?

And how would developers mitigate that? Isn't that the kind of thing you need human content reviewers to handle (assuming any automated tool Twitter had was still there after the layoffs)?

AnotherGoodName · 2024-05-21T16:31:39 1716309099

The development work comes in stopping repeated abuse. Eg. A block of up addresses known to be bad should not be able to repeatedly sign up for new accounts after a ban. You may need to add catchas for users which have certain criteria met. Call pumping needs to be constantly re-scripted to stop bad phone companies tricking your systems into sending SMS to expensive numbers (Twitter turned off all SMS two factor auth to stop this which is telling of the gap they left here).

closeparen · 2024-05-21T17:18:59 1716311939

Rules engines and ML models to escalate interesting Tweets, accounts, and networks of related accounts for human review. Data pipelines to feed them. Change management and observability so that analysts can safely keep them up to speed with threats. Case management systems to put the findings into. Reviews, approvals, deviations, escalations, M of N reviewer consensus. Each one of these things is at least as complex as Twitter's core product.

petsfed · 2024-05-21T16:53:00 1716310380

I feel like this is a huge digression, but this is basically the perfect counter-argument to "its never been illegal to write down license plate numbers/stake out people's houses/write down the addresses of mail delivered to a home/etc, why should we make new rules now that AI is doing it?"

Obviously manually interceding whenever you have a server failure is unsustainable when you have 1000 (or more servers). Why do people somehow believe that manually interceding to prevent bad action with surveillance is somehow sustainable, when previously the rate limiter was that the surveillance itself was manual?

janalsncm · 2024-05-21T17:43:23 1716313403

Definitely. There are emergent properties of technology when applied at scale that make something that was just weird and annoying before into something truly dangerous.

There are a lot of things that are unintuitive at larger scales. AI and intellectual property is another.

And I have sympathy for people who said that algorithmically amplified speech on large platforms is qualitatively different from regular town square speech. It is. I don’t know how the laws should change in response but to pretend that Twitter is just like a bar is pretty naive imo.

Gunax · 2024-05-21T21:51:11 1716328271

I appreciate your comment because I think I am usually on the other side.

petsfed · 2024-05-21T22:22:19 1716330139

Considering that I'm usually on the side of "its already illegal, making it more illegal won't help", I've struggled for a long time to express what exactly was the problem with e.g. automated surveillance, because I couldn't express it in a way that was convincing even to myself.

d1sxeyes · 2024-05-21T16:42:33 1716309753

While the sentiment is lovely, the conclusions ignore business reality. I did some back-of-an-envelope calculations.

A CS rep who “truly cares” is gonna set you back around 50K[0] in salary, call it 75K total cost to employ. I don’t know what their average customer value is, but seems they start doing phone support at 40USD, and 24x7 chat support for all customers. Let’s be generous and assume 50USD/customer average.

That means it takes at least 125 average customers to pay a CS rep’s salary.

Now bear in mind it’s 24x7, so you need at least 8 CS reps. That means you need to be retaining 1000 customers per year just to break even on your CS team. That’s around 20 a week.

The best customer support is the one your users never have to contact. If you can automate fixes (bonus points for automated pro-rated credits on bills based on downtime impact), or improve reliability, the customers with the highest value to you will stick with you.

That’s not to say “give a shitty support experience because it doesn’t matter”, it’s just that it’s a solution to a different problem than the one presented here.

petsfed · 2024-05-21T17:04:59 1716311099

I think your math is not quite right here.

You can probably get away with 1-2 good CS reps, provided the lower tiers have the right tools to triage things effectively. Put another way, if you have 1 CS rep that really cares, that does the followup calls, then you can employ 7-8 front-line CS reps at half the cost who just take diligent notes, and handle the common errors that customers make. I sincerely doubt that the person I have to call to help me connect my cable modem to my ISP makes even 40K a year. But all they have to do is follow a script, write down my answers to their questions, and escalate if the script calls for it.

I agree, completely, however, that ideally, every real issue happens precisely once and results in a change in the automated failure response.

d1sxeyes · 2024-05-21T17:44:24 1716313464

> I sincerely doubt that the person I have to call to help me connect my cable modem to my ISP makes even 40K a year.

That’s sort of my point… do you feel as though that rep really cares? Is that an outstanding customer support experience for you? Many folks would say ISPs and telcos have the worst customer service.

You’re not wrong though that a good script can help a lot of users with simple issues, and as it turns out, that’s the kind of thing that’s actually pretty easy and effective to automate.

petsfed · 2024-05-21T18:20:11 1716315611

It was actually one of the best Telco CS experiences I've ever had, and I think its because the gist of the call was "here are the numbers from my modem, so that you can authorize it to talk to your network, do you see it? Yes? Ok, we're done now".

The main reason bad customer service happens is because the customer has a problem that the CS rep is not empowered to fix or escalate the problem. Its not a question of "caring", its a question of results. Its really hard to care about your job when faced with unrealistic expectations from the customer, and insufficient resources from the employer. A lot of folks with customer service departments could save a bundle on labor and earn a lot of goodwill if they would recognize that.

deathanatos · 2024-05-21T19:20:03 1716319203

> then you can employ 7-8 front-line CS reps at half the cost who just take diligent notes

… the actual reality is that companies employ 7-8 front-line CS reps that can't even read the ticket, and end up asking questions already asked & answered by the form I was required to fill out during the ticket!

I'm not buying upthreads' math either, though. Either the support plan is paid, or not. In the case of paid support plans. AFAICT given the support vs. the pricing for plans I've been a part of, a customer is just profitable. I don't get a full time, 40h/wk support agent, I only get them for the duration of time they're attending my tickets, and that proportional yearly salary cost is < the support plan's price in the cases I've seen; support plans are just ludicrously expensive, but so many companies feel they're obligated to have them, whether for DR, compliance, or whatever, that they don't question it. There are some companies that just do support as part of the included purchase (this is the right way, IMO), in which case, yeah, it takes a few customers. But in this case, how many customers/support rep is dependent on how much you can or are spending on operating costs.

> I agree, completely, however, that ideally, every real issue happens precisely once and results in a change in the automated failure response.

Today's customer support zeitgeist is utterly opposed to this idea. I agree that it's the correct response, but support's goal — that is, the metric which they are apparently measured by, in nearly every case I've seen is, is to close the ticket as fast as possible. This means not waiting around for permanent fixes: if a problem can be kludged around & the ticket closed faster, that wins. But then the extra due diligence of leaning on the engineering side becomes optional, extra … and just doesn't happen.

I've literally had tickets closed because "there hasn't been a response on this ticket for some time" — with the ticket plainly in the provider's court — and "we don't want to waste time your time with long running tickets" — nor a fix to my problem, clearly.

Customer Support is Goodhart's Law.

The latest change is that now I have to fight an LLM that cannot address my problem to get to that front-line rep who won't read my problem. Yay, progress! /s

petsfed · 2024-05-21T19:50:52 1716321052

>Today's customer support zeitgeist is utterly opposed to this idea. I agree that it's the correct response, but support's goal — that is, the metric which they are apparently measured by, in nearly every case I've seen is, is to close the ticket as fast as possible.

First, I can't believe its taken me this long to start italicizing when I quote somebody. I'm going to do that for all future quotations.

Second, I can't easily imagine a better illustration of Goodhart's law than this. I can certainly see how they got to "average time to close", even though obviously the important metric is "how many customers eventually reached a satisfactory conclusion". Its just that its hard to answer that without bugging a bunch of people, reminding them of that time when the product shit the bed and they had to call support.

And "time to close" is not a bad metric, but it can't be the only metric. This reminds me of something that seems like a corollary of Goodhart's law: the easier it is to measure something, the less likely it is to be useful as a metric.

d1sxeyes · 2024-05-22T07:50:59 1716364259

Support is not paid exactly if I understand correctly. According to the pricing, even the lowest tier includes '24/7 WordPress technical expertise', which includes chat support. The second tier includes phone support.

I don't even see any upsell to 'premium support' or anything like that for retail users. However, probably enterprise customers do indeed get separate pricing for support.

All told though, I'm not really saying that my maths was supposed to be on the money, just that the costs of running a CS team like the article suggests are high, and having an expensive CS team is only justified if they drive enough revenue, either through attracting new customers or retaining existing ones.

And to be clear, having outstanding CS is indeed a differentiator. But it's not a differentiator compared to reliability/automation, it's a differentiator compared to other companies' CS teams.

eszed · 2024-05-21T19:38:23 1716320303

I agree with the sibling posts about escalation and so forth, but your numbers also don't capture customer recruitment. I'm fortunate enough to live within Sonic's catchment area in the Bay Area, and have just moved the company for which I'm responsible over to 100% Sonic. My advocacy for them is 85% based on their superlative tech support, and our new business account will pay for half of one of those "truly cares" engineers. I'm damn happy to do it. Incidentally, I think I've persuaded three or four neighbors to switch to (residential) Sonic, and one professional contact to set up a business account for his company.

I think good customer support/relations has to be committed to for reasons that don't immediately show up on spreadsheets, with the trust that they will pay long-term dividends. I realize that is antithetical to The Way Things Work Now, but in both my personal and professional lives I try only to do business with companies that agree.

bytearray · 2024-05-21T16:56:39 1716310599

I had a similar realization when I was at Google. Someone made the observation that if Google had a one-in-a-million error, that meant it was happening 50-60 times per second.

mhb · 2024-05-21T18:32:27 1716316347

And if you're one in a million, there are 8,000 people just like you.

mjb · 2024-05-21T16:23:56 1716308636

> And you can’t “automate” away the rare things, even the technical ones. By their nature they’re difficult to define, hence difficult to monitor, and difficult to repair without the forensic skills of a human engineer.

There are two ways to think about automating response to technical problems: a reactive way and a proactive way. Reactive automation looks to diagnose, repair, or work around system faults as they happen, within the constraints of the design of the system. The proactive approach happens early, at design or architecture time, in designing and building the system in such a way that it is resilient to rare failures and designed to be automatically fixed.

For example, think about a standard primary-failover database system with asynchronous replication. Reactive automation would monitor the primary, ensure it stayed down once it was deemed unhealthy, promote the secondary, provision a new secondary, and handle the small window of data loss. This works great for "fail stop" failures of the primary, and can often recover systems within seconds. Where it becomes much more difficult is when the primary is just slower than usual, or if there's packet loss, or if it's not clear whether it's going to come back up. Just the step of "make sure this primary doesn't come up thinking its still the primary" can be very tricky.

A more proactive approach would look at those hard problems, and try and prevent them by design. Could we design the replication protocol in such a way that the old primary is automatically fenced out? Could we balance load in a way that slower servers automatically get less traffic? Should we prefer synchronous replication or even active-active to avoid the hard question of what to do with lost data or slow primaries?

Looking through a certain lens, the answers to all those questions just add complexity to something simple. But, from another perspective, they take the hardest thing (weird at-scale failures) and turn them into something much simpler to handle. It's an explicit trade-off between system simplicity and operational simplicity. Folks without experience running actual systems tend to have poor intuition about this trade-off. The only way I've seen to help people build good intuition, and make the right decisions, is to get the same folks who are building large-scale systems deeply involved in the operations of those systems.

cortesoft · 2024-05-21T16:47:44 1716310064

This is one of my favorite parts about working for a CDN with tens of thousands of servers around the world… the challenge of designing systems that will encounter rare events all the time.

Race conditions just become conditions, because they will happen all the time. It has never been more important to know which syscalls are actually atomic and under what conditions that holds true.

base698 · 2024-05-24T17:44:39 1716572679

Would you say it's a canary in a coalmine?

AndrewKemendo · 2024-05-21T19:31:03 1716319863

> But some things you can’t automate. You can’t “automate” a knowledgable, friendly customer support team. You can’t “automate” responding to a complaint on social media. You can’t “automate” the recruiting, training, rapport, culture, and downright caring of teams of human beings who are awake 24/7/365, with skills ranging from multi-tasking on support chat to communicating clearly and professionally over the phone to logging into servers and identifying and fixing issues as fast as (humanly?) possible.

From what I can tell of the last 20 years of corporate strategy, they just decide to not do these things instead of automating them.

To the extent where I am unaware of a customer service process that is not a Kafkaesque call-tree at this point

everdrive · 2024-05-21T17:52:49 1716313969

This is also why the internet breeds extremism. The rarest and most extreme ideas feel much more common than they ever did. They propagate differently than they ever used to. And nearly any political cause now fails a "no true Scotsman" check.

thefaux · 2024-05-21T16:13:48 1716308028

On the technical side, this is a case for avoiding languages with exceptions and/or languages that allow silent error dropping. That unhandled exception is the rare event that is actually common.

I will say though that I am a curious what they mean by fatal server error. I once worked at a well known company with about 400 production servers that were all running near capacity. I cannot remember a single serious hardware problem that truly killed a server (we did failover often to a backup but that was usually for software or upstream infra reasons rather than a hardware failure). I understand the scale is lower than in the article, but a server failure every day with a fleet of 2000 server feels like a lot to me.

At any rate, assuming they were just spitballing on a number, the point stands that you need to design and plan for failure even if it is rare. You really don't want the one time the server fails to be the time the CEO is demoing your product to your highest value customer.

invalidptr · 2024-05-21T16:19:44 1716308384

I routinely see hardware failures in a fleet about an order of magnitude larger than the article. It's often enough that we have to plan for and recognize it, but not often enough that we have fully automated handling every edge case.

gmarx · 2024-05-21T18:52:42 1716317562

This is well known in medicine except it often works in the opposite direction, i.e. it tricks people into way overestimating the chance that something rare will happen. So yes, across a population, some children will die of COVID, but your child isn't going to die of COVID.

OTOH it often works against you in the forward direction, e.g. tricking people into thinking we should screen everyone for everything all the time, because they don't appreciate how many incremental adverse events this will result in at scale (vs. the improvements in morbidity/mortality stats at scale)

drewcoo · 2024-05-21T21:03:22 1716325402

Do you think it works that way in medicine or in public health?

Medicine is reactive. It is individually-focused. Something goes wrong, so we visit the doctor.

Public health is proactive but possibly based on bad predictions. It deals with populations instead of individuals. Concepts like "herd immunity" are from public health.

These approaches are oppositional and perhaps complementary.

https://www.ajpmonline.org/article/S0749-3797(11)00514-9/ful...

gmarx · 2024-05-22T22:58:28 1716418708

I don't understand? Which way? Making people overestimate or underestimate risk? I think it works both ways in medicine and public health. In public health (as evidenced by the pandemic) people are still vulnerable to acting on small absolute numbers because they don't see how they compare to the population. Likewise they might ignore a factor that seems like a low percentage thing because they don't get how this translates into real absolute numbers.

pfdietz · 2024-05-21T18:59:55 1716317995

Another possible example of this was the Toyota unintended acceleration problem.

Put millions of cars on the road and instances of unlikely bitflips in memory will occur. The engine software apparently wasn't resilient enough to handle this.

olooney · 2024-05-21T17:13:40 1716311620

"In our business, one in a million is next Tuesday." -Gordon Letwin

SideburnsOfDoom · 2024-05-21T15:48:41 1716306521

Or, as someone said a while back "if you process a million items a day, then one-in-a-million occurrences will happen about daily."

philipov · 2024-05-21T17:25:55 1716312355

They don't become common, they become numerous. Totally different.

lcnPylGDnU4H9OF · 2024-05-21T19:06:40 1716318400

It's common to someone. If $THING happens once every day then it's common even if it's also avoided millions of times a day. If $THING needs to be dealt with by the same person every time, they're going to think it happens commonly since they have to deal with it every day.

JonnyBadFox · 2024-05-22T08:14:33 1716365673

This is interesting, because it also applies to mass-society.

kazinator · 2024-05-22T20:26:50 1716409610

In mature industries, they use the occurrences at scale to reduce the failures, or optimize cost or other factors.

E.g. a certain car part fails so many times out of a million. Okay, change something in the design. Hey look, now it's failing 1/3rd as many times, and it's a bit cheaper. Rinse, lather, ...