When that incident happened, Gitlab published an actual chat transcript of the incident response. It was a very interesting read. It had the real names of the engineers involved for the first 24 hours or so, then they anonymized it. At some later date they seem to have removed it entirely, which is a shame, as it was an educational read
All that I can find left online is this [1], which is still informative, but not nearly as interesting as I remember the chat transcript being
>2017/01/31 23:00-ish
YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com
>2017/01/31 23:27 YP - terminates the removal, but it’s too late. Of around 310 GB only about 4.5 GB is left
> YP says it’s best for him not to run anything with sudo any more today, handing off the restoring to JN.
Then in the post-mortem about lack of backups:
> LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
> Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
I have had (and inevitability will have again) bad days like poor YP. All I can count on is to maintain good habits, like making backups before undergoing production work like YP did.
> like making backups before undergoing production work
The specific part you mention also brings up a really vital part of a backup system, testing that the backups generated actually can restored.
I've seen so many companies with untested recovery procedures where most of the time they just state something like "Of course the built-in backup mechanism work, if it didn't, it wouldn't be much of a backup, would it? Haha" while never actually tried to recover from it.
Although, to be fair, I've only seen one time out of the untested 10s where it had an actual impact and the backups actually didn't work, but the morale hit that the company ended up having made my brain really remember the fact to test your backups.
Indeed, the feeling of dread when you do something that causes prod to go down is bad enough. I can't even imagine the feeling when accidentally deleting prod data...
That's pretty cool.
I notice how MS has become so much more respected as they have embraced open source. Is the next frontier for companies to become open process where the best companies attract the best people because of their open processes?
I’ve actually been exploring this concept myself with a blogging side project I’m working on.
Things were “mehhh” for developer interest for the first month or so, as a I was working off of an internal todo list. I didn’t think anyone cared about my 300 line long todo.txt file, but then I started to wonder if I should find a way to put that doc out in the open for developers to follow, and possibly jump in and contribute.
I had a hinkling that I could use GitHub issues to help with this, but I believed the title “issues” would hurt my project. I was under the impression that a new open source project with a single contributor, and a ton of open “issues” would look bad to developers.
I started to inquire with devs on IH and hear about using issues for feature tracking. Much to my surprise, I got an overwhelming “yes, you need to use issues”. I was also told not to worry about the misleading “issues” title, and that enough developers were knowledgeable enough to know they weren’t just bug reports.
As I started to open issues, and ask for help; surprisingly I started getting traffic and interest. The more issues I opened, and the more open I was online about my code and plans for it; the more followers and contributors I’ve gotten.
My plan at this point is to just follow Gitlabs model, and go full open transparency with everything.
i think ultimately this will not be entirely possible. if a company posts transcripts with real names of people that can be damaging for the people (they moght misbehave simply due to company culture, pressure not visible through the open process etc.) which then would give a company a cheap way out (emplyee scapegoat). since this is a possibility, i think employees amd unions etc. would resist. despite if it was an honest afair it would be awesome, and that (possible but improbable) future looks super cool :)
What's interesting about that is that they have a relatively small db, and that they only had a single snapshot available (and WAL apparently wasn't available or backed up for a point in time restore). We use RDS at AWS, thankfully handles automated daily snapshots and point in time restore via binlog. Cost is relatively cheap too (for a company making millions), and our db is several times larger than Gitlab's was at the time.
It's also pretty worrying that a single developer was doing work directly on production databases without a second person there to say "yeah, looks good". This is a big operational mistake, no matter how good you think you are.
What a difference. GitLab is down -> They will tell on their status page that they have problems. Unlike other cloud providers which when are obviously down, they won't change anything on status page.
Well, consider the fact that "most of the entries" comprise 10% of GitLab and things like "Git Operations" comprise the remaining 90%. This is a total system failure.
At GitHub, we (eventually) ran a completely separate GitHub instance and would deploy there first. We had tooling which allowed us to deploy manually to production from that backup instance in the event we made a breaking change.
Yeah you can't really. It doesn't support a ton of syntax and features.
And they also said they were going to remove that ability too. Tbf to them after the inevitable response they did say they will come up with a proper solution, though that was some time ago.
Obviously the real solution is to use Bazel and not to use Gitlab CI as a crap build system.
Bazel seems like a good idea but it's far too immature to actually work in the FOSS world, almost none of the external _rules are google quality, and it damn near requires a PhD to set up properly.
I spent a good few months learning it and it's not the tool I would reach for in almost any circumstance unfortunately.
Docs are also lacking, which is certainly not a problem with GitlabCI
Yeah it's definitely not simple, but I have repeatedly worked for companies that get stuck on all the problems that Bazel solves. I'm like "we should use Bazel, it solves this issue properly", and they're just like "nah... so anyway this is a big problem, how do we solve it??!".
It would be nice if it had better integration with existing package managers like pip, cargo, npm and so on though. Vendoring your dependencies is fine for a big project like Chrome or Android or your company monorepo; it's a bit annoying for a small CLI tool or whatever.
Another option is something like Landlock Make, but I haven't tried it.
You could, but in all honesty, most programmers (myself included) aren't as brilliant as Linus Torvalds, and, while I could stand up a git mirror in my sleep (because it's trivial - you should be able to as well), the finer points of coordinating work via a git mirror, when https://xkcd.com/1597/ is the level of understand most people have with git, seems to say it's not going to work. Add on the fact that the git repository is the least bit of functionality (eg PRs and pipelines), and being down is a serious barrier to progress.
Yeah, I didn't think of testing and deployment pipelines. Where I work we use GitHub actions for testing, but not for deployment. And our projects are small enough that I can easily run the tests in docker on my development PC.
Agree - also sending a patch via email is definitely possible, but not at all practical. I’m not sure if I’ve ever come across anyone who’s gitops workflows include email
you don't need to send patches via email, you can push them over ssh pretty easily
For example on my team we have a few devs who push patches from laptop directly to their work desktop, as a part of day-to-day synchronization. No central ssh server involved.
Incidents can have different types, i.e. when an application bug or performance regression is discovered, this can involve reverting MRs and rolling back releases. The Platform, Delivery group has a top-level responsibility for ensuring continuous delivery of the GitLab application software to GitLab SaaS, https://about.gitlab.com/handbook/engineering/infrastructure...
Other incidents may involve hardware or infrastructure failures, or a combination of both, infrastructure failure that renders GitLab application services unavailable. This requires cross-functional collaboration from infrastructure, product, engineering, etc. teams in the incident.
Their status page is not very helpful because they just link to a tracker that is also returning a 503. Does anyone know what's going on under the hood?
I get that this is sarcasm, but Gitlab is a lot more than just version control. CI/CD, project management, even hosting (Gitlab pages).
But even for just version control for most use cases you need some common remote for people to push and pull to. Even a bare git setup has the potential of going down and putting people in the exact same boat that they are in now (they can commit locally, but can't push).
I know it is, which imo is the problem. You shouldn’t have to rely on an additional third-party for your company to be able to commit, checkout, build, and deploy code.
All these startups who experience a 100% dev outage when github or gitlab go down are fundamentally broken.
> All these startups who experience a 100% dev outage when github or gitlab go down are fundamentally broken.
On this I totally agree. Although if you have deploys fully automated and gated behind CI/CD checks (which in many cases is required for compliance), how do you structure things such that a total outage of the CI/CD platform doesn't knock the dev cycle?
There is the git send-email & co. I understand the extreme convenience of things like github and gitlab but as everything moves onto them it does seem we are becoming really dependent on a few sites.
Have you tried to run your own mail infrastructure in 2023? It's a nightmare from all the junk mail handling schemes of every individual email provider to just dealing with antiquated software for the mailing list and mailer daemons.
Fetching changes...
Initialized empty Git repository in /builds/jjg/cptutils/.git/
Created fresh repository.
fatal: unable to access 'https://git-us-east1-d.ci gateway.int.gprd.gitlab.net:8989/jjg/cptutils.git/':
Failed to connect to git-us-east1-d.ci-gateway.int.gprd.gitlab.net
port 8989 after 132297 ms: Couldn't connect to server
Yep the web UI is back up and the remote is accepting pushes/pulls now. I'm guessing the major bleeding has been stopped and they have a massive cleanup job to do over the next few days or weeks.
Yeah, I self host a couple of gitlab-ce instances in Docker (one personal, one for work) and can recommend doing so. However, that's just changing where the points of failure are and who gets to do the work when there's failures.
IME it's one microservice doing something unexpected -- then down(|up)-stream starts to act unexpected -- and it cascades.
It's way more fun to debug than some single-monster code base with a complete stack-trace from the single-process. That basically tells you exactly where to look and how it got there! What fun is that, where's the adventure!?