Github is down

drostie · on Sept 10, 2012

I've noticed that these are not immediately marked [dead] by the Hacker News admins like some other spammy topics, and I guess this is a sort of gray-area. On the one hand, it is in some sense a "new thing" and it affects the community; on the other hand it does not particularly gratify anyone's intellectual curiosity to be linked to that status page. The later explanation and diagnosis of what went wrong, if it's made public, might be interesting: just knowing "you can't use this now" isn't of that caliber though.

It's also puzzlingly dynamic: In other circumstances the "current status" pages linked during outages have not been blog-type, but have instead just literally reported the current status, leading to links which say "X is down" -- only for you to click it and see "X is functioning normally." This has already happened for the folks who linked the GitHub main page, which loads normally. (And won't it also prevent the same URLs from being submitted at a future date?)

TazeTSchnitzel · on Sept 10, 2012

Also, many of these things, by the time some people (such as myself) see them on the front page, are already fixed.

doki_pen · on Sept 10, 2012

I remember a time when meta-discussion on HN was looked down upon and never voted to the top of the comments section. I realize the irony of my comment.

hinathan · on Sept 10, 2012

This feels like a pretty standard pattern for a lot of services — fail, come back up on backup DB, fail again when backup proves to not be capable of handling the surge of load, then eventually come back up on the primary DB once people have gotten bored and stopped hitting 'refresh'.

Is that a function of not prewarming failover DBs, or is there something pathological about the primary-secondary pattern?

minikites · on Sept 10, 2012

Maybe we just don't hear about situations when the backup/secondary server succeeds in picking up the slack because it would be transparent to the end user?

hinathan · on Sept 10, 2012

Selection bias, good point.

mechanical_fish · on Sept 10, 2012

Yes, and the selection bias goes even deeper: The simplest possible failover logic is "if something is wrong with our ability to talk to the database, try the secondary database". But, in that case, you almost never see a broken website running on its primary database. Inevitably, the site has already tried failing over to the secondary before it gives up and yells for help.

On the one hand, this is not a good thing, because if you've got a problem that's unrelated to the database (e.g. too much traffic is choking up your supply of DB connections) and then you do a failover, now you have two problems - or, at least, more moving parts to sort out before the situation is resolved. So it's tempting to design a more clever failover scheme. But, on the other hand, cleverness is itself a risk: Not only might your clever algorithm have an even-more-clever pathological failure mode, but it's harder to understand in an emergency. When your stuff is broken, simplicity is your friend. All else being equal, you don't want your front-line emergency responder to have to understand complex failover logic. There is nobody more frustrated than an ops engineer who can't make the system use the primary database because some stupid bot keeps forcing the use of the secondary, or vice versa. In the heat of battle, they're liable to comment out your clever bot and replace it with a one-line shell script.

Engineering is a difficult balancing act.

kevindication · on Sept 10, 2012

Failing over to the secondary only helps if the problem is local to the primary. If you pushed bad code or the system just can not handle the load, the secondary will just fail in the same way.

pinko · on Sept 10, 2012

This is why Chaos Monkeys pay dividends.

Swizec · on Sept 10, 2012

If only there was some way of using git locally without needing github!

Oh wait, this isn't svn.

nirvdrum · on Sept 10, 2012

Aside from the fact this argument falls apart when you use submodules or third party repos, a lot of us use GitHub for the services they offer beyond just git.

rugger · on Sept 10, 2012

After you have init'ed and updated your submodules, the argument holds fine.

nirvdrum · on Sept 10, 2012

Sure. When GitHub is online, the argument holds fine as well. I fail to see what difference that makes. Clearly the situation being described is when the submodules aren't init'd and GitHub is down (e.g., bringing up a new server and deploying with Capistrano).

tow21 · on Sept 10, 2012

If only there were some way to mirror submodules or other third party repos to your own servers without needing github.

Oh wait, this isn't svn.

detst · on Sept 10, 2012

I think he's also referring to website hosting, issues, wiki, etc.

nirvdrum · on Sept 10, 2012

Sure, there is, but then why do I pay for GitHub? And I was mostly alluding to pull requests, issues, pages, etc. All the non-git stuff that I rely on being available as well.

Swizec · on Sept 10, 2012

The outage probably isn't going to last forever. And it's very unlikely you'd need to refresh third party repos in the time github is down.

Now if you're juuuust looking for some library or its README/documentation (I usually use Github for that) ... well I'm sure going for a walk for a few hours won't hurt you :)

eternalban · on Sept 10, 2012

Your statement was "If only there was some way of using git locally without needing github". Stick to it and save the theatrics. Git is a software package that you can download and use, yes Virginia, "locally". As to the availability of a remote host serving a precious resource not being available, well Virginia, those are just the facts of networking life.

nirvdrum · on Sept 10, 2012

It's actually quite likely. If you have a git repo listed in a Gemfile, for example, and you need to bring up a new server or deploy, you're going to have problems. Sure, the outage isn't going to last forever. But the standard, snarky reply about git working fine locally is really tiresome.

davvid · on Sept 11, 2012

Instead of a snarky reply, perhaps someone will think of a good solution.

The simplest possible thing: instead of supporting a single git URL, allow specifying several. That way, if anything goes down the system can fallback to the next git server.

Another nice benefit of doing this could be automatic load balancing.

foz · on Sept 10, 2012

In my experience most Rails deployments and CI setups depend on Github being up. While it's possible to bundle and copy code on deploy rather than checkout from the server, most projects don't.

So of course you can continue developing locally, but if Github is down, it means the release/ci process is likely broken. I'm not saying this is Github's responsibility, but it's a reality for many people. It's another sales point for Github Enterprise.

Diederich · on Sept 10, 2012

My company uses github enterprise, which is a VM that's very easy to make highly available.

stevejalim · on Sept 10, 2012

Link: https://enterprise.github.com/

mariusbutuc · on Sept 10, 2012

With licenses not available for less than 20 seats, it's not really feasible for teams of 10 or less.

dougbarrett · on Sept 10, 2012

I would think that $5k/year would be a small price to pay for backing up your code, especially if your business model relies on coding every single day.

borlak · on Sept 10, 2012

There are also major security reasons to do this -- not relying on github to secure their website/application, which they have already failed to do in the past.

adamfeber · on Sept 10, 2012

Assembla has a installable repository manager for unlimited SVN and Git projects on your server and it is free for up to 10 users. http://portfolio.assembla.com/repository_manager.html

tmh88j · on Sept 10, 2012

Ever tried using gitX? I really like it.

Peroni · on Sept 10, 2012

Frontpaging on HN will certainly help.

Andrex · on Sept 10, 2012

The purpose of a "status" page is ostensibly so that it can get the most visibility when the service is down, so that people aren't constantly refreshing the main site.

patrickaljord · on Sept 10, 2012

Also the status page is usually hosted on a different server than the service it reports on. Otherwise it would go down with the main service and be pretty useless.

Peroni · on Sept 10, 2012

I know. I forgot my /sarcasm tag. ;-)

orangethirty · on Sept 10, 2012

Now that is a good service status page. It made me go from "Github sucks" to "Go github." It is well pieced, informative, and upfront about the whole deal. Their auto refresher seals the whole deal. It shows they are confident on their skills to get the problem fixed. Bravo github. Well done. Now hurry up and finish so you can answer the support email I sent this morning. :)

anupj · on Sept 10, 2012

I can see a github post-downtime analysis blog post coming up (I hope). :)

0xbadcafebee · on Sept 10, 2012

Question for the Github people: Why not keep serving non-stale cached data while your databases are down?

You can do this with proxies or by modifying your code to always serve out of cache, and the db updates the cache, so if the db is down, the cache is your temporary failover while you fail over to the secondary db. ('cache' is anything memcached-like that's separate from your db)

jeremymcanally · on Sept 10, 2012

We do in a lot of cases. We just don't have every detail in the whole app cached. :)

benatkin · on Sept 10, 2012

You don't have some things cached that should be obvious to cache, like the HTML for the most popular repos. Loading the commit messages from JSON for a page that's being accessed tens of thousands of time a day is less than ideal.

It isn't just github, it seems like a lot of web apps don't use the lessons learned for web sites.

0xbadcafebee · on Sept 11, 2012

I think there's a culture with modern developers that says using older, less sexy technology isn't going to work as well as newer, sexier technology. "Cache HTML?! That's so inefficient!" Yeah, it also just works, too. When all your databases, content engines, storage services, deployment tools, etc all take a crap, your clunky little web proxy cache keeps right on humming and your customers get at least a half-functioning site, if they notice at all.

0xbadcafebee · on Sept 10, 2012

For the stuff you haven't got serving out of cache yet, might I recommend putting a proxy on the frontend layer? It can reverse proxy almost all of your stuff, and cache/proxy the remaining content which isn't dynamic (or even dynamic stuff, given the right voodoo). Cheap and codeless.

mylittlepony · on Sept 11, 2012

If only there was a better git hosting provider. Oh wait, I use bitbucket.org!

binarydreams · on Sept 10, 2012

Yay!! We made it, frontpage on HN!

lazyjones · on Sept 10, 2012

Cool, an auto-refreshing status page that will mostly be looked at when the servers (possibly the network) are already stressed.

tomschlick · on Sept 10, 2012

That page is hosted on separate infrastructure than their production app.