I get the angry unicorn page "No server is currently available to service your request. Sorry about that. Please try refreshing and contact us if the problem persists. Contact Support — GitHub Status — @githubstatus" with that last link going to https://x.com/githubstatus showing "GitHub Status Oct 22, 2018 Everything operating normally."
Making logins required to view twitter was the ultimate bed shitting move. The whole point of twitter was to be a broadcast medium. Tweets were viewable without following or logging in. There is a huge vacuum in that space now.
For most (social media) platforms really. Management believes it would force users to sign up, but in reality the platform just becomes less relevant because of that limitation. Not even talking about search crawlers.
An all around stupid decision. That said, if management is that shitty, the platform probably won't be attractive for long anyway.
Facebook/Instagram were successful despite that to a degree, but this decision probably still did a lot of damage to their relevancy and user numbers.
I don't really agree with that. Facebook was originally about mirroring your real-life social network. In the mid 00's nobody was trying to get likes from strangers on Facebook.
Instagram is closer to broadcast, but it was always closely tied to the mobile app experience and the "follower" mentality. People didn't really share links to Instagram posts in other online venues in the beginning.
Twitter was always unique. It existed before smartphones, and there was a good chunk of years where people without smartphones would read twitter posts on desktops. Its producer/consumer distribution is much more skewed, many twitter users never post. Tweets were always getting posted to places like HN, reddit, discussed in news articles, etc.
I think Twitter's (former) position as a broadcast medium à la TV, radio, and newspapers is unique among social networks. There's a reason why Twitter was the place for journalists, politicians, academics, fire departments, web service status alerts, etc.
> Facebook/Instagram were successful despite that to a degree, but this decision probably still did a lot of damage to their relevancy and user numbers.
FB/IG/Whatsapp have half of humanity logging into their services once per month, so I'm not sure how much better they could be doing if they didn't have a login wall.
Meanwhile, Twitter (with no login wall) never broke 500mn. Like, personally I totally take your point about status updates but I'd have used my Twitter account a lot more if I'd needed to log in to see the content.
Used to work ops at AWS. I don't know if it's still the case but it required VERY HIGH management approval to actually flip any lights on their "status page" (likely it was referenced in some way for SLAs and refunding customers).
That is an excellent illustration to Goodhart's law. We're going to have this avesome status page, but since if we update it the clients would notice the system is down, we're going to put a lot of barriers to putting the actual status on that page.
Also probably a class action suit lurking somewhere in there eventually.
It's because of the way most companies build their status dashboards. There are usually at least 2 dashboards, one internal dashboard and one external dashboard. The internal dashboard is the actual monitoring dashboard, where it will be hooked up with other monitoring data sources. The external status dashboard is just for customer communication. Only after the outage/degradation is confirmed internally, then the external dashboard will be updated to avoid flaky monitors and alerts. It will also affect SLAs so it needs multiple levels of approval to change the status, that's why there are some delays.
> The external status dashboard is just for customer communication. Only after the outage/degradation is confirmed internally, then the external dashboard will be updated to avoid flaky monitors and alerts. It will also affect SLAs so it needs multiple levels of approval to change the status, that's why there are some delays.
This defeats the purpose of a status dashboard and is effectively useless in practice most of the time from a consumers point of view.
From a business perspective, I think given the choice to lie a little bit or be brutally honest with your customers, lying a bit is almost always the correct choice.
My ideal would be if regulations which made it necessary that downtime metrics had to be reported with at most somewhere between a 10m and 30m delay as "suspected reliability issue".
If your reliability metrics have lots of false positives, that's on you and you'll have to write down some reason why those false positives exist every time.
Then that company could decide for itself whether to update manually with "not a reliability issue because X".
This lets consumers avoid being gaslighted and businesses don't technically have to call it downtime.
This is intentional. It's mostly a matter of discussing how to communicate it publicly and when to flip the switch to start the SLA timer. Also coordinating incident response during a huge outage is always challenging.
FWIW, our self-hosted Gitea instance has not had a single second of unplanned downtime in five years we've been running it. And there wasn't much _planned_ downtime because it's really easy to upgrade (pull a new image and recreate the container — takes out the instance for maybe 15 seconds late at night), and full backups are handled live thanks to zfs.
Migration to a new host takes another 15 seconds thanks to both zfs and containers.
I don't know how many GitHub downtime reports I've seen during that time, we're probably into high dozens by now.
I've been running Gitea on my homelab for a few months now. It's fantastic. It's like a snapshot of a point in time when GitHub was actually good, before it got enshittified by all of the social and AI nonsense.
I've been moving most of my projects off of GitHub and into Gitea, and will continue to do so.
We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back.
To be fair - I really couldn't care less is the homepage is loading or not.
So long as I can fetch/commit to my repos, pretty much everything else is of secondary, tertiary, or no real importance to me.
(At work, I do indeed have systems running that monitor 200 statuses from client project homepages, almost all of which show better that 99.999% uptimes. And are practically useless. Most of them also monitor "canary" API requests which I strive to keep at 99.99% but don't always manage to achieve 99.9% - which is the very best and most expensive SLA we'll commit to.)
Where on the continent? GitHub is undoubtedly doing blackbox testing internally and has multiple such monitors but that's not going to capture every customer's route to them, leading to the same problem - customers experience GitHub being down, despite monitoring saying it's mostly up. Thus the impass. Even doing whitebox testing, where you know the internals and can this place sensors intelligently, even just for ingress, you're still at the mercy of the Internet.
If a sensor that's basically in the same datacenter says you're up, but the route into the datacenter is down, then what? multiply this by the complexity of the whole site, and monitoring it all with 100% fidelity is impossible. Not that it's not worth it to try, there's a team at GitHub that works on monitoring, but beyond motivation about keeping the SLA up, as a customer, unless you notice it's down, is it really down? In a globally distributed system, downtime, except for catastrophic downtime like this, is hard to define on a whole-site basis for all customers.
I don't think anybody asked for 100% fidelity. We are talking about a complete outage that affected at least North America and Europe. If the status page shows green in such a case, its fidelity is around 50%. People expect better from GitHub.
The amount of moaning that the status page wasn't updated in 0 seconds and had the wrong status for entire minutes is what leads me to believe that no, users do expect 100a% fidelity.
Total outages are rare enough, and there's enough other work, that spending time building a system for that, just doesn't seem like the best use of their time. though I'm biased, having faced that exact question from the inside, at different company.
At the moment, all github services seem to be restored, and the github status indicates that the problem is still ongoing. I don't think it's related to the SLA, but rather to the monitoring, which is not live. There are a few minutes of delay.
I don't think they were asking for corporate speak. But at least I would find a plain technical error message like "cannot contact file server" much more respectable than something like "unicorns are hugging our servers uwu".
This “ironic” and “humorous” style of errors and UI captions is the actual new corporate speak. I’d prefer dumb error messages rather than some shit someone over the ocean thinks is smart and humorous. And it’s not funny at all when it’s a global outage impacting my business and my $$$.
It's closer to the truth than you usually get. They're having a bad day, it's completely true. It's the start of my day, but I guess this is the middle of the night for them. There's no such thing as unicorns, but that just highlights the metaphorical nature of the remaining claim - getting Unicorns under control means solving their problems. Normally "professional" corporate speak means avoiding saying anything whose meaning is plain on its face and disconfirmable while avoiding the implication that the company is run and operated by humans. This is a model. (Obviously the came up with the message in advance, which just goes to show that someone in the company is well enough rounded to know that if it is displayed, they're having a bad day.)
Looks like we have a full house outage at GitHub with everything down. Much worse than the so-called Twitter / X recent speed-bump that was screeched at and quickly forgotten.
I don't think GitHub has recovered from the monthly incidents that keeps occurring. Quite frankly it is the expectation that something will go down every month at GitHub which shows how unreliable the service is and this has happened for years.
I guess this 4 year old prediction post really aged well after all about self-hosting and not going all in on Github [0]
The timing is pretty uncanny. I just deployed a github page and had a DNS issue because I configured it wrong. I hit "check again" and github went down.
Perhaps this is a repeat of the Fastly incident with a customer's Varnish cache configuration causing an issue in their systems (I think this is a rough summary, I don't remember the details).
So, you're both responsible and not responsible at the same time :)
> Hope I don't appear in the incident report.
Appearing in an incident report with your HN username could be pretty funny...
I had a github page that was public, but it was made private and the DNS config was removed. Fast forward to today. I made the private repo public again and forced a deploy of the page without making a new commit. It said the DNS config was incomplete, so I tweaked it and hit "check again" and github went down.
It is kinda amazing how consistently status pages show everything fine during a total outage. It's not that hard to connect a status page to end-to-end monitoring statistics...
From my experience this requires a few steps happen first:
- an incident be declared internally to github
- support / incident team submits a new status page entry (with details on service(s) impact(ed))
- incident is worked on internally
- incident fixed
- page updated
- retro posted
Even aws now seems to have some automation for their various services per region. But it doesn't automatically show issues because it could be at the customer level or subset of customers, or subset of customers if they are in region foo in AZ bar, on service version zed vs zed - 1. So they chose not to display issues for subsets.
I do agree it would be nice to have logins for the status page and then get detailed metrics based on customerid or userid. Someone start a company to compete with statuspage.
Once in the past I did actually have an incident where the site went down so hard that the tool that we used to update the status page didn't work. We did move it to a totally external and independent service after that. The first service we used was more flaky than our actual site was, so it kept showing the site down when it wasn't. So then we moved to another one, etc. Job security. :)
They say you shouldn't host status pages on the same infrastructure that it is monitoring, but in a way that makes it much more accurate and responsive in outages!
Most status page products integrate to monitoring tools like Datadog[1], large teams like github would have it automated.
You ideally do not want to be making a decision on whether to update a status page or not during the first few minutes of an incident, bean counters inevitably tend to get involved to delay/not declare downtime if there is a manual process.
It is more likely the threshold is kept a bit higher than couple of minutes to reducing false positives rates, not because of manual updates.
Nah, _most_ status pages are hand updated to avoid false positives, and to avoid alerting customers when they otherwise would not have noticed. Very, very few organizations go out of their way to _tell_ customers they failed to meet their SLA proactively. GitHub's SLA remedy clause even stipulates that the customer is responsible for tracking availability, which GitHub will then work to confirm.
It's 00:16, just about to go to bed, I ran `git push` and it's not working. Check Github, says it's down, I think it's only me, maybe I'm blocked, Github can't be down. Come here to check and it's down for everyone, such a relief.
Anybody who publishes an app on the Google Play store and hosts their privacy policy on Github pages may have their app taken down because Google's bots won't be able to verify it exists.
That happened to me a while back with an app listing that was almost 10 years old because the server I was hosting the policy on went down. Ironically, I switched it to Github pages so it wouldn't happen again.
I see more and more people use less Github, but some other git solutions.
I am afraid to think what to do when GitHub is down for hours (need to learn maillists?).
Another reason is that MS may be in phase when it will ask to pay for using GitHub just for reads (rate limiter).
I recently looked into using Git in a decentralized way. It's actually pretty easy!
When you would usually create a PR, you use `git format-patch` to create a patch file and send that to whoever is going to merge it.
They create a branch and use `git am` to apply the patch to it, review the changes, and merge it to main.
It is nice that git supports multiple remotes, though. It feels good to know that `git push` might not work for my project right now, but I know `git push srht` will get the code off of my laptop.
> I recently looked into using Git in a decentralized way. It's actually pretty easy!
Well, that's how it was designed to work! The whole point of Git is that it's a distributed version control system, and doesn't need to rely on a centralized source of truth.
I used to work at a company with very draconian policies. Whenever I needed to update some code on a public GitHub repository, I would just push to a remote that was a flash drive. Plug it in my machine at home, pull from that remote, push to origin.
I also had to setup a bidirectional mirror back when bandwidth to some countries was restricted. We would push and pull as normal, and a job would keep our mainline in sync.
It is sad that most organizations forget that git is distributed by nature. We often get requests to setup VPNs and all sorts of craziness, when a simple push to a bare mirror would suffice. You don't even need anything running, other than SSH.
The real reason not to use github anyway though is that it's terrible (the basic "github model" for doing code review was basically made up on the back of a napkin IMO)
Services that explicitly needed the API were also down, and it wasn't pretty. For example: Minecraft Mod packs that rely on SerializationIsBad all went kerplunk! I'm sure a lot of people were scratching their heads yesterday wondering why they couldn't do anything for a time.
What made me laugh though was when the "X is functioning normally" immediately followed by "X is degraded, continuing to monitor" messages that kept popping up then right back to "normal" again, all in the same 30 second timespan... made me giggle
Unfortunately, outages happen... This situation is a very good reminder of why having backups and a solid Disaster Recovery plan is crucial. Of course, it’s easy to assume that cloud services are always up, but we should never forget about outages. Setting up automated backups for repos and metadata can save a ton of headaches when things go wrong. Plus, having a Disaster Recovery plan means you’re not stuck waiting for the service to come back online—you can keep working with minimal disruption.
Things seems to be ack'ed:
```
Investigating - We are investigating reports of degraded availability for Actions, Pages and Pull Requests
Aug 14, 2024 - 23:11 UTC
```
This was not the first or last GitHub outage. Unfortunately, this was a big one. That why now it is even more important to have backups of your work and restore capabilities in case of a scenario like this outage. This article sheds light on the importance of backups along with the best practices to follow:
They're pulled from our CDN by default. Only if you use experimental flakes is GitHub in the loop. And even if GitHub isn't down you can't pull nixpkgs more than twice per hour without running into rate limits and get your IP banned. Don't rely on GitHub for critical infrastructure.
Update - Issues is experiencing degraded availability. We are continuing to investigate.
Aug 14, 2024 - 23:19 UTC
Update - Git Operations is experiencing degraded availability. We are continuing to investigate.
Aug 14, 2024 - 23:19 UTC
Update - Packages is experiencing degraded availability. We are continuing to investigate.
Aug 14, 2024 - 23:18 UTC
Update - Copilot is experiencing degraded availability. We are continuing to investigate.
Aug 14, 2024 - 23:13 UTC
Update - Pages is experiencing degraded availability. We are continuing to investigate.
Aug 14, 2024 - 23:12 UTC
HN has a strange philosophy built into its ranking algorithm that an item with a large number of comments early on should be de-ranked because the conversation is likely to be of poor quality.
And so goes all your packages, private repositories, pages, AI intern copilot bot and Github Actions; and soon your AI models once you host them there - all being unavailable and going down with GitHub.
Time to consider self-hosting like the old days instead of this weekly chaos at GitHub.
Who's the Bozo Doofus maintainer? https://yhbt.net/unicorn/LATEST. I love that we can still see Unicorn in action. I rarely had problems with it back in the day.
Cause seems to be database related per most recent update (23:29 UTC):
> We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back.
Aug 14, 2024 - 23:29 UTC
Latest update at 23:29 UTC says: "We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back."
Aug 14, 2024 - 23:29 UTC
Update - We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back.
We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back.
hope it is back up soon
We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back.
hope it comes back soon
There goes Pages, there goes the CDN for release artifacts, there goes any package manager hosting repositories on GitHub. Is this outage just contained to github or is it an Azure outage?
Yep, angry unicorn.
If the copilot debacle wasn't reason enough to make people migrate or diversify the code repo efforts with, let's say, GitLab, this should.
This reminds me that for some reason I am logged into my gaming machine's windows store with my GitHub account thanks to the bizarre way that microsoft do auth.
And has a website so anyone could just ask me if something went wrong on github's side and I can send them a complete copy. Decentralised version control is nice!
Update - We are investigating reports of issues with GitHub.com and GitHub API. We will continue to keep users updated on progress towards mitigation.
Aug 14, 2024 - 23:16 UTC
EDIT: The reply link is no longer available.
Update - Packages is experiencing degraded availability. We are continuing to investigate.
Aug 14, 2024 - 23:18 UTC
I love how the same people who try to drag me towards using Git are the only people who seem to have serious problems working on their code when a website goes down.
They probably have a reverse proxy in front of all their http endpoints and that is still up and able to show the unicorn if the backends aren't responsive.
The static content on the error page might also be on akami or cloudflare side.
Unicorn has a slightly different architecture.
Instead of the nginx => haproxy => mongrel cluster setup
you end up with something like: nginx => shared socket => unicorn worker pools
When the Unicorn master starts, it loads our app into memory. As soon as it’s ready to serve requests it forks 16 workers. Those workers then select() on the socket, only serving requests they’re capable of handling. In this way the kernel handles the load balancing for us.
Love that HN is a better status page for dev services than most companies can manage to provide. Knew I'd find it here but on the front page within 3 minutes is impressive.
Reminds me of a repository I once found when searching for Prometheus exporters.
It stated this but with Twitter; it will monitor latest tweets searching for a custom word combo and raise a server alert when found. I found it hilarious. Will post the source once GitHub is back on.
I guess when GitHub goes down, it is somehow strangely tolerated for years even after the acquisition and goes down more times than Twitter. When the latter encounters a speed-bump, just like the 'interview' with Trump, it's global news because a Mr Elon Musk owns it.
Both seem to be doing too much all at once. But really it is worse with Github if this is what Microsoft stewardship is incidents every, single week and each month guaranteed for years.
What makes you say it’s “somehow strangely tolerated” when GitHub goes down?
What’s the point of bringing up twitter? It is strange to seek victimhood for a petulant billionaire. Of course, it is worse with GitHub because GitHub actually provides useful functionality.
> What makes you say it’s “somehow strangely tolerated” when GitHub goes down?
The same folks complaining about something at GitHub going down are the same people that stay and are willing to tolerate the regular incidents and chaos on the site.
It is the fact that not only the Github incidents have been happening for years, it has gotten worse as there is an incident every month.
> Of course, it is worse with GitHub because GitHub actually provides useful functionality.
That isn’t an excuse for tolerating regular downtime for a site with over 100m+ active users, especially with it running under Microsoft stewardship who should know better.
Any other site with that many users and with a horrendous record of downtime like Github would be rightfully branded as unreliable. No excuses.
A reminder of how centralized and dependent the whole industry has become on GH, which is ironic, considering that git itself is designed to be decentralized.
Good opportunity to think about mirroring your repos somewhere else like Gitea or Gitlab.
Github is more than a remote host for git repositories. It's become one of the major CDNs for software distributions. Github Pages host a majority of static sites that developer use. You won't be able to use Cargo, Nix, Scoop and other package managers right now because their registries have a critical dependency hosted on Github.
This is not to mention all the projects that rely on Github for project management, devops, community and support desk.
GitHub is also very international, I doubt isolated netziens like those from China are shielded from this outage. I imagine very, very few software shops are unscathed by this. The whole affair is very on brand for 21st century software which is to say pitiful.
- Champion a hard to use VCS which to its credit is distributed
- Make everyone dependent on all the centralized features of your software to use Git[1][2]
- Now you have a de facto centralized, hard to use VCS with thousands of SO questions like “my code won’t commit to the GitHub”
- Every time you go down a hundreds-of-comments post is posted on HN
How to get bought for a ton of cash by a tech mega corporation.
[1] Of course an exaggeration. Everyone can use it in a distributed way or mirror. The problem occurs when you’re on a team and everyone else doesn’t know how to.
[2] I’m pretty sure that even the contributors to the Git project rely on the GitHub CI since they can’t run all tests locally.
We installed a private GitLab instance on our own servers exactly out of fear that Github might suddenly alter the deal or just cease operations. Pretty happy with our decision so far.
Actually both. Our internal closed source projects are only in our GitLab. The open-source stuff is both on GitHub and our GitLab. Since our GitLab instance isn't public we only use the issue tracker on GitHub for public stuff.
The key difference is being able to mirror communication channels. While you can continue to work fine with your local repo, the only way to share those changes are via another forge, or sending patches through some other channel. Having another forge to distribute code is generally more ideal.
The odds of all services rm -rf / at the same time are pretty small to be honest. The point is to have your work in multiple places, such that you're not reliant on a single service.
Senior: Ah found it! Let's just rollback one revision on the db.
Newguy: let me fix this! `kubectl rollout undo ... --to-revision=1`
Newguy: Ok, Started rollback to revision one!
Senior: Uh-oh..
The status page says all is well, though: https://www.githubstatus.com/. Hilarious.