Hacker News new | past | comments | ask | show | jobs | submit login
Gitlab.com is down (status.gitlab.com)
133 points by freedomben on July 7, 2023 | hide | past | favorite | 100 comments



I hope they didn't delete the production database again.


When that incident happened, Gitlab published an actual chat transcript of the incident response. It was a very interesting read. It had the real names of the engineers involved for the first 24 hours or so, then they anonymized it. At some later date they seem to have removed it entirely, which is a shame, as it was an educational read

All that I can find left online is this [1], which is still informative, but not nearly as interesting as I remember the chat transcript being

[1] https://about.gitlab.com/blog/2017/02/10/postmortem-of-datab...



>2017/01/31 23:00-ish YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com >2017/01/31 23:27 YP - terminates the removal, but it’s too late. Of around 310 GB only about 4.5 GB is left

I can't even imagine the sinking feeling..


> YP says it’s best for him not to run anything with sudo any more today, handing off the restoring to JN.

Then in the post-mortem about lack of backups:

> LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage > Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.

I have had (and inevitability will have again) bad days like poor YP. All I can count on is to maintain good habits, like making backups before undergoing production work like YP did.


> like making backups before undergoing production work

The specific part you mention also brings up a really vital part of a backup system, testing that the backups generated actually can restored.

I've seen so many companies with untested recovery procedures where most of the time they just state something like "Of course the built-in backup mechanism work, if it didn't, it wouldn't be much of a backup, would it? Haha" while never actually tried to recover from it.

Although, to be fair, I've only seen one time out of the untested 10s where it had an actual impact and the backups actually didn't work, but the morale hit that the company ended up having made my brain really remember the fact to test your backups.


Indeed, the feeling of dread when you do something that causes prod to go down is bad enough. I can't even imagine the feeling when accidentally deleting prod data...


That's pretty cool. I notice how MS has become so much more respected as they have embraced open source. Is the next frontier for companies to become open process where the best companies attract the best people because of their open processes?


I’ve actually been exploring this concept myself with a blogging side project I’m working on.

Things were “mehhh” for developer interest for the first month or so, as a I was working off of an internal todo list. I didn’t think anyone cared about my 300 line long todo.txt file, but then I started to wonder if I should find a way to put that doc out in the open for developers to follow, and possibly jump in and contribute.

I had a hinkling that I could use GitHub issues to help with this, but I believed the title “issues” would hurt my project. I was under the impression that a new open source project with a single contributor, and a ton of open “issues” would look bad to developers.

I started to inquire with devs on IH and hear about using issues for feature tracking. Much to my surprise, I got an overwhelming “yes, you need to use issues”. I was also told not to worry about the misleading “issues” title, and that enough developers were knowledgeable enough to know they weren’t just bug reports.

As I started to open issues, and ask for help; surprisingly I started getting traffic and interest. The more issues I opened, and the more open I was online about my code and plans for it; the more followers and contributors I’ve gotten.

My plan at this point is to just follow Gitlabs model, and go full open transparency with everything.

My side project mentioned above can be downloaded here: https://github.com/elegantframework/elegant-cli


i think ultimately this will not be entirely possible. if a company posts transcripts with real names of people that can be damaging for the people (they moght misbehave simply due to company culture, pressure not visible through the open process etc.) which then would give a company a cheap way out (emplyee scapegoat). since this is a possibility, i think employees amd unions etc. would resist. despite if it was an honest afair it would be awesome, and that (possible but improbable) future looks super cool :)


What's interesting about that is that they have a relatively small db, and that they only had a single snapshot available (and WAL apparently wasn't available or backed up for a point in time restore). We use RDS at AWS, thankfully handles automated daily snapshots and point in time restore via binlog. Cost is relatively cheap too (for a company making millions), and our db is several times larger than Gitlab's was at the time.

It's also pretty worrying that a single developer was doing work directly on production databases without a second person there to say "yeah, looks good". This is a big operational mistake, no matter how good you think you are.



Is this original the chat transcript? https://news.ycombinator.com/item?id=36635060


I believe this contains excerpts from the original transcript, at least. I'm not sure if this is the same document I read, but if not then it's close!


They actually broadcasted a video call of them fixing it too, but they removed that as well.


I thought the parent post was a joke, and then I read your link...


I can hang with a little coffee time on a Friday and just commit locals for now.

Github by comparison has had more than 30 outages this year alone. https://www.githubstatus.com/


Here is a retelling of this time:

"Dev Deletes Entire Production Database, Chaos Ensues" https://youtu.be/tLdRBsuvVKc


Lol


What a difference. GitLab is down -> They will tell on their status page that they have problems. Unlike other cloud providers which when are obviously down, they won't change anything on status page.


I can't say anything bad about gitlab. Incidents happen and I really like that they are transparent about it.


The status page links to https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1... for details, but that 503s because it's on gitlab.com


> Components: Website, API, Git Operations, Container Registry, GitLab Pages, CI/CD - GitLab SaaS Shared Runners, CI/CD - GitLab SaaS Private Runners, CI/CD - Windows Shared Runners (Beta), SAML SSO - GitLab SaaS, Background Processing, GitLab Customers Portal, Support Services, packages.gitlab.com, version.gitlab.com, forum.gitlab.com, docs.gitlab.com, Canary

> Locations: Google Compute Engine, Digital Ocean, Zendesk, AWS

that's some _impressive_ blast radius and I actually struggle to think what would cause that much damage


DNS, obviously.


According to the title of the issue (2023-07-07: Blackbox probes for https://cdn.artifacts.gitlab-static.net/cdn-test are failing) you win the game:

  $ host cdn.artifacts.gitlab-static.net
  Host cdn.artifacts.gitlab-static.net not found: 3(NXDOMAIN)
They must have used that same healthcheck for every single load-balancer or something, though, for it to nuke AWS, DO, and GCP


> that's some _impressive_ blast radius

Most of the entries look like batch job processing, either by gitlab or managing jobs in runners.

Perhaps authentication, which is mandatory when managing batch jobs including in third party nodes.

If auth goes down on any service, you'll see similar blast radius.


Well, consider the fact that "most of the entries" comprise 10% of GitLab and things like "Git Operations" comprise the remaining 90%. This is a total system failure.


Might be a domain issue or a routing misconfiguration. But people are mentioning 5xx errors so it could also be FELB/NLB misconfiguration.


Thanks for your feedback. GitLab team member here.

The incident review is updated in https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1... I have added your feedback into https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1...


Considering Gitlab employees use Gitlab to make changes to Gitlab, how will they fix Gitlab?


At GitHub, we (eventually) ran a completely separate GitHub instance and would deploy there first. We had tooling which allowed us to deploy manually to production from that backup instance in the event we made a breaking change.


It's git, so everyone has a copy of the repo. You can continue collaborating with send-email or some random SSH server. That's the beauty of a DSCM.


Yeah, but they also use the Pipeline features to build and deploy changes.


This. Merge to master -> deploy to staging -> test runs -> push button enables for go-to-prod.


But they are clever, you can have the runner local.

(They start to ruin that, but lets say, the idea is still visible.)


Yeah you can't really. It doesn't support a ton of syntax and features.

And they also said they were going to remove that ability too. Tbf to them after the inevitable response they did say they will come up with a proper solution, though that was some time ago.

Obviously the real solution is to use Bazel and not to use Gitlab CI as a crap build system.


Bazel seems like a good idea but it's far too immature to actually work in the FOSS world, almost none of the external _rules are google quality, and it damn near requires a PhD to set up properly.

I spent a good few months learning it and it's not the tool I would reach for in almost any circumstance unfortunately.

Docs are also lacking, which is certainly not a problem with GitlabCI


Yeah it's definitely not simple, but I have repeatedly worked for companies that get stuck on all the problems that Bazel solves. I'm like "we should use Bazel, it solves this issue properly", and they're just like "nah... so anyway this is a big problem, how do we solve it??!".

It would be nice if it had better integration with existing package managers like pip, cargo, npm and so on though. Vendoring your dependencies is fine for a big project like Chrome or Android or your company monorepo; it's a bit annoying for a small CLI tool or whatever.

Another option is something like Landlock Make, but I haven't tried it.


Yeah, why not? But often GNU Make is fine for incremental builds.


If by local you mean local to a cloud instance that runs automated builds on every merge for specific branches, then yes.

Granted, those build instance pull from the central git repo, and are triggered from GitLab... so also, no.


local like in tree.


Only for `script: "echo hello world"` style jobs; anything more real makes `gitlab-runner exec` a total farce


GitLab team member here.

If you plan on a GitLab mirror, including the Git repository and groups/projects, the direct transfer migration can be helpful https://about.gitlab.com/blog/2023/01/18/try-out-new-way-to-...

GitLab Geo can be an alternative for larger, high-availability setup requirements. https://docs.gitlab.com/ee/administration/reference_architec...


You could, but in all honesty, most programmers (myself included) aren't as brilliant as Linus Torvalds, and, while I could stand up a git mirror in my sleep (because it's trivial - you should be able to as well), the finer points of coordinating work via a git mirror, when https://xkcd.com/1597/ is the level of understand most people have with git, seems to say it's not going to work. Add on the fact that the git repository is the least bit of functionality (eg PRs and pipelines), and being down is a serious barrier to progress.

http://Xkcd.com/303 but for GitHub is down.


Yeah, I didn't think of testing and deployment pipelines. Where I work we use GitHub actions for testing, but not for deployment. And our projects are small enough that I can easily run the tests in docker on my development PC.


Agree - also sending a patch via email is definitely possible, but not at all practical. I’m not sure if I’ve ever come across anyone who’s gitops workflows include email


You've never come across the Linux kernel? The Linux kernels process isn't practical? Okay, that's an interesting point of view.


you don't need to send patches via email, you can push them over ssh pretty easily

For example on my team we have a few devs who push patches from laptop directly to their work desktop, as a part of day-to-day synchronization. No central ssh server involved.


I think it gets deployed from https://ops.gitlab.net/


GitLab team member here. Thanks for asking.

Incidents can have different types, i.e. when an application bug or performance regression is discovered, this can involve reverting MRs and rolling back releases. The Platform, Delivery group has a top-level responsibility for ensuring continuous delivery of the GitLab application software to GitLab SaaS, https://about.gitlab.com/handbook/engineering/infrastructure...

Other incidents may involve hardware or infrastructure failures, or a combination of both, infrastructure failure that renders GitLab application services unavailable. This requires cross-functional collaboration from infrastructure, product, engineering, etc. teams in the incident.

To get a better understanding here, it is helpful to review the incident management handbook https://about.gitlab.com/handbook/engineering/infrastructure...

Additional helpful information:

- The GitLab.com SaaS production architecture is documented in https://about.gitlab.com/handbook/engineering/infrastructure...

- The Monitoring of GitLab.com handbook provides insights into monitoring workflows, incident management, SLAs, etc. https://about.gitlab.com/handbook/engineering/monitoring/

- Runbooks https://about.gitlab.com/handbook/engineering/infrastructure...

For the current incident discussed in this HN thread, the review issue can be followed in https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1... to learn more.


Their status page is not very helpful because they just link to a tracker that is also returning a 503. Does anyone know what's going on under the hood?


GitLab team member here. Apologies for the problem. The incident review issue continues in https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1...



Jesus, this is worse than AWS... had to scroll a while to see the one service was broken.


That is just some extra logging, it has been happening to one of my clusters for weeks


Looks like their infrastructure team is hiring site reliability engineers. https://about.gitlab.com/jobs/all-jobs/


Good thing I'm off for the weekend.


Status page is hosted by status.io behind cloudfront.net(amazon.com)


I'm seeing a mix of 503s, 522s, 500s, and once in a while a successful page load on the website. No word on the root cause at this point.


in case folks found the relevant submission via https://news.ycombinator.com/from?site=status.gitlab.com this is the url-less submission: https://news.ycombinator.com/item?id=36634089


This also affects Gitlab Pages. If your site is hosted there, your visitors are getting 503s.


My GitLab Pages site is working but I'm getting 503s when attempting to push to the repo and browsing the GitLab site


My site seems unaffected.


It may be affecting authorization, then. I have a site that's private and requires auth and I can't reach it.


If only there was a way to distribute version control and not rely on a central entity


I get that this is sarcasm, but Gitlab is a lot more than just version control. CI/CD, project management, even hosting (Gitlab pages).

But even for just version control for most use cases you need some common remote for people to push and pull to. Even a bare git setup has the potential of going down and putting people in the exact same boat that they are in now (they can commit locally, but can't push).


I know it is, which imo is the problem. You shouldn’t have to rely on an additional third-party for your company to be able to commit, checkout, build, and deploy code.

All these startups who experience a 100% dev outage when github or gitlab go down are fundamentally broken.


> All these startups who experience a 100% dev outage when github or gitlab go down are fundamentally broken.

On this I totally agree. Although if you have deploys fully automated and gated behind CI/CD checks (which in many cases is required for compliance), how do you structure things such that a total outage of the CI/CD platform doesn't knock the dev cycle?


A gitlab outage is like a snow day for devs. Welp can’t do anything, see you guys back tomorrow.


There is the git send-email & co. I understand the extreme convenience of things like github and gitlab but as everything moves onto them it does seem we are becoming really dependent on a few sites.


Have you tried to run your own mail infrastructure in 2023? It's a nightmare from all the junk mail handling schemes of every individual email provider to just dealing with antiquated software for the mailing list and mailer daemons.


I'm not sure why you think you would need to "run your own mail infrastructure" to use git send-email, but you don't.


And yet you can host your own Gitlab instance, so...


GitLab appears to be back online (for me anyway). Status page still shows red tho.


CI jobs are still failing ...

  Fetching changes...
  Initialized empty Git repository in /builds/jjg/cptutils/.git/
  Created fresh repository.
  fatal: unable to access 'https://git-us-east1-d.ci gateway.int.gprd.gitlab.net:8989/jjg/cptutils.git/': 
  Failed to connect to git-us-east1-d.ci-gateway.int.gprd.gitlab.net 
  port 8989 after 132297 ms: Couldn't connect to server


Yep the web UI is back up and the remote is accepting pushes/pulls now. I'm guessing the major bleeding has been stopped and they have a massive cleanup job to do over the next few days or weeks.


I'm still getting...

    remote: ERROR: The git server, Gitaly, is not available at this time.
What a drag... I guess I'll just commit locally... bah! Friday anyway!


Approaching wine o'clock in my time zone ...


I know! I can feel it coming as well... just riding that last hour at the office.


Welp, time to migrate everything to GitHub


You have to wait till GitHub is up


That’s why I store all my code on Twitter.


I thought we should use the Blockchain.


"Only wimps use tape backup. REAL men just upload their important stuff on ftp and let the rest of the world mirror it." Linus Torvalds


Technically a git repository is a blockchain.


Nah I print all my code.


You mean Threads*


Lucky for you, I have a shitty, not-well-tested tool to do that.

https://github.com/tylerjgarland/git2git

lol. I used it to migrate my gitlab to github on the last outage.


Yes, that'll be so much better...


At least their status page works.


Well, those incidents are becoming rather frequent.


Just another reason to self host...


Yeah, I self host a couple of gitlab-ce instances in Docker (one personal, one for work) and can recommend doing so. However, that's just changing where the points of failure are and who gets to do the work when there's failures.


> However, that's just changing where the points of failure are and who gets to do the work when there's failures.

...and more importantly who gets to work so that there isn't a failure.


Just a normal day with microservice.


GitLab isn't really a microservice architecture.


GitLab team member here.

Next to all-in-one packages provided with the Omnibus package on Linux https://docs.gitlab.com/omnibus/ and in containers https://docs.gitlab.com/ee/install/docker.html you can also install GitLab into Kubernetes clusters using the Operator https://docs.gitlab.com/operator/ or Helm Chart. https://docs.gitlab.com/charts/

Most of GitLab.com SaaS is deployed on Kubernetes using the GitLab cloud-native Helm chart (with a few exceptions). More details in https://about.gitlab.com/handbook/engineering/infrastructure...


Monoliths crashing and taking everything down with them isn't news, right?


Not quite, it's 100x harder to find out where is the cause from with microservice.


IME it's one microservice doing something unexpected -- then down(|up)-stream starts to act unexpected -- and it cascades.

It's way more fun to debug than some single-monster code base with a complete stack-trace from the single-process. That basically tells you exactly where to look and how it got there! What fun is that, where's the adventure!?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: