Ask HN: Plenty of large sites down; Reddit.com, GNU.org, Discord, coincidence?

IloveHN84 · on Aug 12, 2018

Is AWS/Azure/GCE down?

Remember: the cloud is someone's else computer. When it's broken, you cannot do anything

nik736 · on Aug 12, 2018

So what? I would rather have Google/Amazon employees on the issue than some random DevOps dude.

i_r7al · on Aug 12, 2018

The random dude is your employee and his/her mani job to get it to work and it’s first proirity. You will never be AWS/GCP/Azure first priority.

tobylane · on Aug 14, 2018

How could I be their second priority? For any one service and availability zone isn't it all or nothing (or slow).

zzzcpan · on Aug 12, 2018

> So what? I would rather have Google/Amazon employees on the issue than some random DevOps dude.

This is fine if three nines of availability is all you need. Doesn't matter much if you prefer a big brand employee fixing things or a small brand employee. It doesn't change the outcome.

However there are a lot of things that simply cannot live with crappy three nines availability. And the only way to do better is to stop relying on any single cloud, which inevitably requires infrastructure engineers aka random devops dudes.

throwawaymath · on Aug 12, 2018

In fairness one "random DevOps dude" might be equally capable and less expensive for your infrastructure. Generally speaking any software company can succeed without a cloud provider's infrastructure, it's just a matter of cost and developing that competency in-house. There are many site reliability engineers who specialize in high availability and downtime resolution on baremetal hardware. StackExchange notably has this competency internally.

user5994461 · on Aug 12, 2018

Site reliability is a new fancy name for the sucker who is on call.

They will change career after being forced to work on week ends and holidays a few times. Incidentally, today is a Sunday AND the most taken holiday of the year.

throwawaymath · on Aug 12, 2018

Not all site reliability engineers have a bad work life balance.

user5994461 · on Aug 12, 2018

The ones at AWS and Google, maybe not.

The ones who are the single "random DevOps dude" at a small company trying to emulate AWS and Google, do have zero balance.

xd · on Aug 12, 2018

Well I guess the whole essence of HN is now lost... no room for some "random" startup dude to do anything that can be trusted.

verletx64 · on Aug 12, 2018

I think you can concede that if we're talking a five nines foot race, you're unlikely to beat out AWS etc.

zzzcpan · on Aug 12, 2018

Five nines means you can't rely on a single infrastructure provider for anything ever and it doesn't actually matter whether you use AWS at all.

apple4ever · on Aug 12, 2018

I wouldn’t. Hire the right person and you have immediate response instead of waiting or somebody else. A large reason we are not going cloud for our new infrastructure.

nik736 · on Aug 12, 2018

And this one person never sleeps and is 24/7 on-call, right?

geofft · on Aug 12, 2018

and isn't about to quit once they have a little more money saved up?

hluska · on Aug 12, 2018

Beware of the bus factor! Going in house is great, as long as everything is well documented and your company has good backup resources.

user5994461 · on Aug 12, 2018

It's Sunday. Please wait for Monday office hours for your immediate response.

zaarn · on Aug 12, 2018

At work we have a 24/7 20min response time clause. If the phone for work emergency calls we are ready to help in 20 minutes at any time around the clock even on Sunday.

Why would you do anything else for your sysop/sysadmin?

user5994461 · on Aug 12, 2018

You surely realize that no human being can be available 24/7 within 20 minutes. It's beyond slavery to expect that from any employee.

You need at least 10 sysop/sysadmin to achieve anything close to that SLA, with a sustainable rota. Contrary to the parent posters who believe it can be done with THE right guy.

greenleafjacob · on Aug 13, 2018

With 3 people you can have a "follow the sun" rotation during business hours which takes care of the entire week, and I don't think you would need 7 more people for the weekend.

zaarn · on Aug 13, 2018

We manage it with 3 sysop/sysadmins. I think you vastly overestimate how many people you need to keep a 20 minute SLA.

Spooky23 · on Aug 14, 2018

That’s a shitty life.

zaarn · on Aug 14, 2018

I find it rather insulting that you think it must be a shitty job.

Spooky23 · on Aug 14, 2018

Expectation of 24/7, 20 minute response to a callout is bullshit. If that’s not shitty, you’re a victim of some sort of employee Stockholm syndrome.

zaarn · on Aug 14, 2018

Not all 3 have to respond all the time. It's a rotating schedule for having the overnight phone and the weekend phone (during weekend nights the SLA is relaxed to 1 hour).

And keep in mind, 20 minute response doesn't mean you fix the problem in 20 minutes, it means you respond in 20 minutes to the callout.

I think you're victim of an easy-going startup culture.

Spooky23 · on Aug 14, 2018

I misread it.

I’ve been oncall for escalations for like 15 years. That’s miserable enough, IMO frontline guys need fixed schedules and rotation if the volume is high.

I definitely cast my perspective on this and apologize if I came on too strong,

user5994461 · on Aug 14, 2018

Sorry to say, you're right and he has Stockholm syndrome. It's a very aggressive schedule, must be missing holidays half of the time to keep the phone.

simonjgreen · on Aug 14, 2018

Why would you hire just one person?

zacharycohn · on Aug 12, 2018

person

jrs95 · on Aug 12, 2018

This just doesn’t make sense. Google/Amazon employees basically are some random DevOps “dudes”. Whereas your own people would be...whoever you decided to hire to work on your infrastructure.

leot · on Aug 12, 2018

Infrastructure (ops, software, hiring, social) matters.

Hiring some dude won't give you that.

devwastaken · on Aug 12, 2018

Problem being that Infrastructure is made to be very, very compicated because they're selling tons of managed features that have reliance upon each other. I don't know why this idea of having your own devops is suddenly now bad. renting your own bare hardware and managing solutions yourself is still very much a thing and something you have to do if you're using lots of bandwidth. Dropbox for example made their own infrastructure to get off of AWS [1].

[1] https://techcrunch.com/2017/09/15/why-dropbox-decided-to-dro...

hluska · on Aug 12, 2018

I don't think that anyone thinks having your own infrastructure is bad, or if they do, they likely don't have much experience. Rather, I think it can be a nightmare if it's not properly managed, and it's hard to develop the skill to properly manage the process unless you've been burned in the past.

module0000 · on Aug 12, 2018

Hiring the right dude will. The trouble is, most of them are happily employed elsewhere.

allannienhuis · on Aug 12, 2018

at AWS/Google? :P

user5994461 · on Aug 12, 2018

The right dude if both on week end AND on holidays right now.

Please wait until Monday 20th to get a status on the issue. Thank you.

lameiam · on Aug 12, 2018

yeah, by google and amazon.

bootloop · on Aug 12, 2018

It's like an insurance. You wouldn't be able to hire as many and as highly paid random DevOps on your own. So you and others pay a 3th person to do it, just in case you need a competent person.

dullgiulio · on Aug 12, 2018

Would be big news to discover that GNU.org runs on some cloud hosting ...

solarkraft · on Aug 12, 2018

I'd rather expect it to be a problem with some other provider. Proxy, or something.

ajeet_dhaliwal · on Aug 13, 2018

I’d need to see some spy shots of Stallman in an Uber, talking on an iPhone and tapping out a denial on a Surface to really believe this news was true.

ekr · on Aug 12, 2018

GNU.org relying on a cloud provider? Isn't that going against the GNU philosophy a bit?

pigscantfly · on Aug 12, 2018

Speculating here, but I was woken a half-hour back by an SMS from our prod monitoring system. The people at Azure had required maintenance for some instances scheduled for this morning, which I had had performed during the scheduled window over the last two weeks, but they seem to have brought down two thirds of the instances anyways. Possibly unrelated; just my two cents.

ReverseCold · on Aug 12, 2018

Last time this happened for me it was Cloudflare's fault. Google and some other really large sites worked, but not much else.

throwawaymath · on Aug 12, 2018

That's pretty vacuous. Everyone's computer is someone's computer. The more important point is how capable you are at managing it yourself.

What you're trying to get at is this: would you rather trust your infrastructure to a large organization whose core competency it is to do so, or would you rather manage it yourself? For many companies it makes more sense to have someone else manage it because of division of labor.

If you believe you're better suited to managing your own hardware for cost or capability reasons, you should. But of the arguments in favor of that decision, pointing out that "you cannot do anything" when GCP/AWS/Azure has downtime is a pretty poor one. It's an exceptional circumstance if you're 1) able to achieve better uptime than a cloud provider, 2) at nearly the same cost (in personnel, hardware and software), and 3) while being relatively unaffected by the downtime of major cloud providers anyway.

The companies for which the calculus shifts in favor of managing their own hardware probably don't need to be told "the cloud is just someone else's computer." In contrast, most companies using a cloud provider do not have a readily available alternative because they do not have in-house talent capable of maintaining baremetal hardware (local or colocated).

I consider myself personally capable of maintaining a baremetal distributed system with high availability, because I presently do that. But for the most part I wouldn't encourage companies using a cloud provider to invest in their own infrastructure. It's usually expensive in personnel, time or both.

asdojasdosadsa · on Aug 12, 2018

I'm not the best at interpreting this map[0] but seems that something is going on?

[0] http://www.digitalattackmap.com/#anim=1&color=0&country=ALL&...

user5994461 · on Aug 12, 2018

Looks like an ongoing DNS reflection attack from all over the world to the US.

sonofblah · on Aug 12, 2018

What's the significance of Poland in the output? A tracking thing?

I noticed problems with Reddit earlier, too.

wildrhythms · on Aug 12, 2018

Where are you seeing 'Poland in the output'?

>Edit: Nevermind, I see what you mean (on the map). I'd be interested to know too... maybe PL is a big player in their attack monitoring?

seba_dos1 · on Aug 13, 2018

I don't think so - when you look at historical data, labels change. Seems like Poland is simply a significant actor in this particular attack.

mrdrozdov · on Aug 12, 2018

Looks like wunderground.com is down too. If you're wondering, high chance of thunderstorms this evening in New York, NY.

slavojastoria · on Aug 12, 2018

I was wondering, thanks. Rain has been wild recently

CPUstring · on Aug 12, 2018

Whatever is happening, it got me out of bed instead of browsing endlessly

JonathanBouman · on Aug 12, 2018

https://status.discordapp.com states that Discord identified and resolved the problem.

noobermin · on Aug 12, 2018

I suppose I came late because only gnu.org is down of those mentioned.

gjvc · on Aug 12, 2018

defcon week

fibers · on Aug 12, 2018

But why gnu? They seem like a static site that can only be taken down by simple ddos?

DonHopkins · on Aug 12, 2018

The Emacs web server had to garbage collect.

https://www.emacswiki.org/emacs/HttpServer

stephengillie · on Aug 12, 2018

Everyone has to start somewhere - maybe some script kiddies are at their first Defcon and saw an easy target?

aviau · on Aug 12, 2018

Why does the fact that the site is static make it easier to take down by a simple ddos?

I have a static website at https://alexandreviau.net/. It sits behind AWS CloudFront. Good luck taking it down.

williamdclt · on Aug 12, 2018

That's not what the parent said, they just said that the only thing that can take down GNU.org is DDoS

rbanffy · on Aug 12, 2018

It's not simpler - quite the contrary. Dynamic sites can be taken down by a number of different attacks. Static sites are a much harder target.

wild_preference · on Aug 12, 2018

It will DDoS your wallet with CloudFront charges which is even worse.

They just need to hit you over a longer time scale and avoid making obvious peaks so that you can’t ask for the DDoS refund.

lawnchair_larry · on Aug 12, 2018

lol, no

digi_owl · on Aug 12, 2018

That, and various incidents on twitter etc over the years makes me really question the professionalism of _sec...

DmenshunlAnlsis · on Aug 12, 2018

Reddit is up as of this writing, although GNU.org is down.

JonathanBouman · on Aug 12, 2018

Pretty unstable here, it loads but all the user specific pages return 'error code: 503'

digi_owl · on Aug 12, 2018

And their status page shows all green...

https://reddit.statuspage.io/

nolok · on Aug 12, 2018

Although I like the concept at this point status pages are very disappointing to me, between those that stay green when everything is failing because they're not updated properly, those that stay green because "it was a localized partial failure only" even though the whole thing breaks (hi aws !), ... Sure some are reliable, but enough aren't that it feels like you can't trust them.

You can't look at the status page and believe what it says, so you go and ask people anyway (on irc, reddit, hnews, whatever community you like). Meaning that page might as well not have existed.

RileyJames · on Aug 12, 2018

Couldn’t agree more. I pushed for one to be implemented at my last job (api) as I felt it was ridiculous that we didn’t have a means to communicate downtime, outages, issues.

Initially the status page worked. But as more and more people subscribed to it, it became a bigger issue, to issue an alert.

And unfortunately an issue couldn’t be raised only to those it was relevant for.

All this lead to was, not updating the status page and thus it becoming a useless tool to determine if an issue was occurring.

Back to Twitter...

I feel the product needs a lot work in practice, and possibly in implementation and training.

nolok · on Aug 12, 2018

Ah, sadly I believe your personal experience is very common.

It's insane really; a company puts out a status page to say to their customers "you can trust and rely on us through that dedicated medium to know our status", and if the customers in question buy into the proposition and use it the very first thing that company does is make it so you cannot trust and rely on them through that dedicated medium. Succedding is what causes it to ultimately fail.

Status page should have stayed as undocumented features for "the little guys" behind the scene to communicate and never get into the open world where PR and marketing and decision makers can roam.

zaarn · on Aug 13, 2018

Status pages shouldn't need any manual intervention.

I setup mine to automatically monitor my website from another service provider in a different datacenter. That way I know if the server is down for any reason and it updates automatically.

If my server goes down, within 5 minutes the status page is red. End of story.

If it's backup the status page goes green again.

Manual status pages are a mistake.

mrmattyboy · on Aug 12, 2018

It shows a big spike in 500s and dip in request rate around 2 hours ago

zabana · on Aug 12, 2018

Up in Europe (Paris) at 5:08 local time.

a012 · on Aug 12, 2018

I use Reddit is fun app and browsing fine

thiscatis · on Aug 12, 2018

Turkey?

batuhanicoz · on Aug 12, 2018

I don't understand the relation/reference.

I'm Turkish and have been watching the news but I don't see any reason why someone correlates large websites being down with Turkey. With no explanation too.

Can you elaborate please? This is an honest question and I would like to know if my government is hacking foreign sites in retaliation for sanctions.

thiscatis · on Aug 12, 2018

Well Trump obliterated any chance of a decent lira value for the next years with his tweets and sanctions.

terminalcommand · on Aug 12, 2018

Why would you think that Turkey would retaliate against Trump's tweets by taking down reddit and GNU.org? I don't think that the Turkish government has nearly enough technical knowledge to pull of something like that. That is the problem with Turkey right now, it seems to me that the government doesn't want to work with qualified people.

batuhanicoz · on Aug 12, 2018

Honestly, I wouldn't dismiss the technical capabilities of the Turkish government.

duxup · on Aug 12, 2018

Turkey's government was on that way already, Trump just gave it an extra push.

oneplane · on Aug 12, 2018

Retaliation would make sense, but I haven't dug deep in Turkeys APT crews lately. Most of the stuff I hear/read about is talking about Iranian, Russian, Chinese and roaming APT groups doing attacks. It also would make attacks from Turkish AS's more logical as the government would not likely do something about 'their own' for free.

based2 · on Aug 12, 2018

https://kb.isc.org/article/AA-01639/74/CVE-2018-5740%3A-A-fl...

Twirrim · on Aug 12, 2018

What evidence do you have to support this claim?

I could sit here and just pull out random CVEs too, with as much validity.