Can't wait to find out what happened here. This seems to be a massive outage. In...

simias · on May 23, 2024

My subjective impression as a web user since the late 90s is that now things break relatively rarely (I think it's the first time I have any such issue with DDG for instance) but when they do a huge chunk of the web becomes unreachable.

Back when things were more decentralized individual websites and services would have issues much more regularly because the individual software and hardware stacks weren't as robust and fault-tolerant, but then usually the problem would always be limited to a single website/service.

AbstractH24 · on May 23, 2024

And America only uses the term “Too Big Toto Fail” when it comes to banks.

lelanthran · on May 23, 2024

> And America only uses the term “Too Big Toto Fail” when it comes to banks.

And only when they're not in Kansas, anymore.

theshrike79 · on May 23, 2024

It's always DNS.

onion2k · on May 23, 2024

Or BGP

jeffrallen · on May 23, 2024

Unless it's Nagle. (Sorry animats.)

us0r · on May 23, 2024

My money is on expired domain somewhere or security certificate.

jraph · on May 23, 2024

A Windows update restarted a critical server automatically. A core service is blocked from starting by a Candy Crush ad installed by the update. The Crandy Crush ad is somehow expecting a Copilot key to be pressed on the keyboard to let the system keep going.

MS engineers are waiting for an online purchase of a new $400 keyboard with the Copilot key to complete and planning to run to the data center to plug the keyboard.

However, the Bing outage is preventing the purchase to go through, because the payment somehow relies on Bing suggestions to load for obscure reasons.

exikyut · on May 23, 2024

Very very technically, if RDP is enabled and working, this could be fixed by rdesktop-ing to the machine from a Linux box and using xdotool to experiment with typing raw keyboard scancodes through the RDP session in the hope you figure out the encoding of the Copilot key.

jraph · on May 23, 2024

Neat answer :-)

I also appreciate that you prevented me from countering using a guard like "if RDP is enabled and working", and that a follow up answer actually provides the missing piece xD.

r2_pilot · on May 23, 2024

The Copilot key is actually Left-Shift + Windows key + F23

adamomada · on May 23, 2024

And you can remap it but can’t use it as a modifier key https://www.tomshardware.com/software/windows/windows-copilo...

ornornor · on May 23, 2024

MS knows better than to run windows on their servers. They’ve famously been running Linux on their public web facing stuff for years, including when they were publicly discrediting Linux (because IIS and MS server were so good they couldn’t run their web services reliably and securely).

Kerbonut · on May 23, 2024

No way Microsoft could be this inept… could they?

rchaud · on May 23, 2024

Update: the payment gateway is Stripe, which is not processing any transactions associated with the MS account. A developer has posted the issue to HN in the hope that a Stripe employee will see it and escalate the issue. /s

manuelmoreale · on May 23, 2024

I love that you had to include the /s because someone might actually believe that is indeed the case. What a bizarre world we live in.

adamomada · on May 23, 2024

Poe’s Law is pretty old by now

manuelmoreale · on May 23, 2024

I get that but still. I found it amusing.

the_biot · on May 23, 2024

Not necessarily massive. Given that Bing works again now, this seems more like an API frontend failure, or an internal routing failure at some level.

Note they seem to have managed to fix the Bing frontend hours ago, but DDG is still dead in the water. Priorities... :-)

pythux · on May 23, 2024

Bing does not work me at the moment (maybe they can service partial traffic? Unclear if it’s better/worse than a few hours ago when the outage started)

radiorental · on May 23, 2024

Genuine question, are distributed systems naturally more resilient?

I can see arguments for both sides. Your point and then the hidden failure modes without central observability and ownership. Nothing exists in isolation.

zevv · on May 23, 2024

Not distributed per se, but diversity makes a huge difference in resilience.

When everybody is using the exact same tech, the fall out of an incident can be huge because it will affect everybody everywhere at the same time. Superficially it might seem efficient and smart, but the end result is fragility.

Diversity of species is what nature ended up with as the ultimate solution: the individual species do not matter, but life as a whole will be able to flourish. With technology, we're now moving the other way: every single thing gets concentrated into one of the few cloud providers. Resilience decreases, fragility increases.

salawat · on May 23, 2024

I prefer heterogeneity rather than diversity. Different implementations of similar processes fenerally make different tradeoffs, incurring different bottlenecks, and resulting in an ecosystem with a higher statistical probability that one relative Black Swan won't wipe out a key structural function in it's totality.

It's actually a hallmark of building fault tolerant systems and ecosystems. Pity the economists and MBA's can't be convinced of it. Otherwise there'd be less push to create TBTF institutions.

_heimdall · on May 23, 2024

Distribution alone doesn't make a system resilient. A distributed system can help with resilience for anything related to network or hardware failure, but even then you need to make sure the different resources don't have a hard dependency on each other.

If you want a resilient system redundancy and automatic failover systems are really important, along with solid error handling.

Think about a distributed data store for example. You may spread all your data across multiple distributed areas, but if each area is managing a shard of data and they aren't replications then you still lose functionality when any one region goes down. If you instead have a copy complete copy of data in each region, and a system to automatically switch regions if the primary goes down, your system is much more resilient to outages (though also more complex and expensive).

Timshel · on May 23, 2024

It does not garanty resiliency but it does increase it.

If tomorow mastodon.social disappear the network might lose 80% of it's content but recovery could be possible even if the server never come back.

_heimdall · on May 23, 2024

My point was just that resilience still depends on how a system is distributed and what else is done.

Distribution alone doesn't really make a difference, though pairing it was redundancy and failovers is going to get pretty far.

The case of mastodon.social is really a question of whether the value there is the network and protocol itself or the user created content posted there. If its the user content, the value is lost when the one host goes away. If the value is the network and protocol then yes, the value of the network is still there even though the data is gone. It does raise an interesting question of whether Mastodon is really considered distributed or not, the network is and hosts are using a shared protocol but the data isn't really distributed.

Timshel · on May 23, 2024

Yes there is the question of network vs data :). And as you mention while some data end-up being distributed with Activity Pub the protocol is not made to allow restoration.

One point I find interesting too is that distributed network often allows more agency to external actors. For example if you believe that the resiliency of the mastodon.social instance is not enough for you then you can decide to host you own server with your preferred criteria.

_heimdall · on May 23, 2024

That's really where ActivityPub starts to rub me the wrong way. Server admins really need moderation power since everything is hosted on their hardware, but it also is a poison pill for decentralization.

I can host my own server and make my own rules, but every other admin can just ban my instance.

lxgr · on May 23, 2024

I feel like that's actually a counterexample. At least most people with mastodon.social as their home server will probably not have a backup of their followed/following graph and never be able to recover.

oefrha · on May 23, 2024

With a large number of small providers, more often than not some of them will fail on any given day, but stars need to align really well to get a half-of-the-internet-is-down kind of failure caused by AWS or Cloudflare.

halfcat · on May 23, 2024

Not exactly “more resilient”, but rather, “the only way to gain more resiliency over a single system”.

A distributed system can be more resilient, but it also adds complexity, making it (sometimes) less reliable.

A single system with a lot of internal redundancy can be more reliable than a poorly implemented distributed system, which is why at a smaller scale it’s often better to scale vertically until a single node can’t handle your needs.

Distributed systems are more of a necessity than “the best way”. If we could just build a single node that scaled infinitely, that would be more reliable than a distributed system.

cpeterso · on May 24, 2024

“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” — Leslie Lamport, 1987

steve1977 · on May 23, 2024

Distributed systems with tight coupling and no redundancy are less resilient. It's not so much a question about distribution but more about redundancy and coupling.

naasking · on May 24, 2024

> Genuine question, are distributed systems naturally more resilient?

Only if they've prioritized the "availability" component from the CAP theorem.

Dalewyn · on May 23, 2024

>are distributed systems naturally more resilient?

All else being equal: Yes.

It's like asking if a RAID1 is more resilient than a single drive.

steve1977 · on May 23, 2024

RAID1 is mirrored. That is not what I would call a typical distributed system. It is a very redundant system. Like a cluster.

A distributed system without redundancy would rather be something like data stripped across disks without parity.

And that actually makes it less resilient, because failure of one component can bring down the whole system and the likelihood of failure is statistically higher because of the higher number of components.

Gormo · on May 23, 2024

When I think of distributed systems, the RAID1 analogy seems much more applicable than RAID0.

The term "distributed" has been traditionally applied to the original design of the TCP/IP protocol, various application-layer protocols like NNTP, IRC, etc., with the common factor being that each node operates as a standalone unit, but with nodes maintaining connectivity to each other so the whole system approaches a synchronized state -- if one node fails. the others continue to operate, but the overall system might become partitioned, with each segment diverging in its state.

The "RAID0" approach might apply to something like a Kubernetes cluster, where each node of the system is an autonomous unit, but each node performs a slightly different function, so that if any one node fails, the functionality of the overall system is blocked.

That second approach seems more consistent with what we traditionally label as "distributed" -- for example, the original design of the TCP/IP protocol, along with lots of application-layer protocols like NNTP and IRC, have each node operating autonomously but synchronized to other nodes so the whole system approaches a common data state. If one node fails, the other nodes all continue to operate, but might cause the overall system to become partitioned, leading to divergent states in each disconnected segment.

The CAP theorem comes to mind: the first approach maintains availability but risks consistency, the second approach maintains consistency but risks availability. But the second approach seems like a variant implementation strategy for what is still effectively a centralized system -- the overall solution still exists only as a single instance -- so I usually think of the first approach when something is described as "distributed".

bradjohnson · on May 23, 2024

You're assuming a stateful system where the state is distributed throughout the components of the system. For a stateless component of a distributed system, you don't need redundancy to recover from an outage.

>likelihood of failure is statistically higher because of the higher number of components

Yes, absolutely true, but resiliency for a distributed system is not necessarily like your example of data stripped without parity, unless we're specifically talking about distributed storage.

CWuestefeld · on May 23, 2024

To the GP's point - if you lose the RAID controller, then you've lost a whole lot more than just a single drive failure.

Gormo · on May 23, 2024

The controller isn't stateful; it's just an interface to the disks. If the controller fails, but the disks haven't, then all you've lost is the time it takes to plug the disks into a new controller.

With RAID1, there's also nothing specific to the RAID configuration inherent in the way the data is encoded on the disk. You might have to carefully replicate your configuration to access the filesystem from a failed RAID0 array, but you can just pull and individual disk out of a RAID1 array and use it normally as a standalone disk.

Dalewyn · on May 23, 2024

Yes, RAID isn't a backup, but it is resilient.

You will have a better chance at uptime with a RAID than a single drive so you hopefully don't have to climb up ventilation ducts, walk across broken glass, and kill anyone sent to stop you on your quest to reconnect those cables that were cut.

re-thc · on May 23, 2024

Used a Co-pilot enabled PC.