Hacker News new | past | comments | ask | show | jobs | submit login
GitHub Incident (githubstatus.com)
213 points by wgjordan on March 22, 2022 | hide | past | favorite | 165 comments



This has been happening pretty much every day for a long time. Maybe this will push more people towards some decentralization. A self-hosted Gitea instance my organization has been using had exactly zero downtime in the past year. None. Same with our CI system. Yes, I know, GitHub is 10⁹⁹⁹ times larger than our puny Gitea instance, and that's why they're having issues, but why should I care when one is working, and the other one isn't?


>> A self-hosted Gitea instance my organization has been using had exactly zero downtime in the past year. None. Same with our CI system.

And when they do go down, how long will they be unavailable? When this happens (and it will happen) please post here so we can say "why are you running infrastructure yourself?"


I'm honestly sick of this conversation.

Have we as an industry forgot how to make systems? that's pretty telling if so.

Why are people so utterly afraid of running their own systems.

If a SaaS platform goes down it's followed by "#hugops; running large systems is hard!", when someone says "Well, if it's hard, why not break it into smaller pieces and run it myself" it's met directly with "Do you think you can do better!?!".

Honestly, probably, maybe not. Why does it matter?

Those two things cannot be true simultaneously. You cannot say "running big systems is hard" and "they can run a big system better than you can run a small system".

I'm not going to fall into the trap of arguing points like the fact that if you own the data you can have your own D/R & backup strategies and the fact you can run maintenance's on your own time (hell: you can run maintenance's to migrate things around, which is more than github can do, which is not an indictment of them, just the nature of being a hosted service).

Honestly, I miss sysadmins. People who were not afraid of servers, even if they said "no" sometimes. Because, you know, being afraid of servers and hosting things is just pathetic.


From an organisational perspective, outsourcing infrastructure is the safest and cost effective way to meet Regulatory & Compliance requirements.

When you run your own servers, not only do you need to maintain them, you are also responsible for their certification and getting them audited.

As a manager in a medium sized company, I think it is essential to identify what "needs building" and what could be bought to ensure the needs of stakeholders are met in a timely manner.


Heh, time was "compliance" was the reason to host your own services. It is a lot easier to be compliant with many things if you're hosting your own services.

Back when I was working in eCommerce (PCI-DSS Tier 1, Cardholder data on premises) it was basically impossible to use hosted services.

As then, it is now: "Compliance" is just enough nebulous of a word to prevent critical thinking.


I wouldn't call compliance a deterrence to critical thinking. The end goal for any business is to make a profit, if an option exists that complies with industry standards, which also happens to be the most inexpensive one in terms of resourcing, it would be foolish not to take it.


I think you misunderstand the intention. The parent comment is comparing uptime to their own system and declare superiority, and the one you reply to is asking the the opposite — the duration of downtime.

Sure you can build your own infrastructure but don’t pretend that’s superior. Each has its pros and cons.

The last line probably is making fun of their reaction — any person as dogmatic as the parent commenter would comment someone with their own infrastructure going down with that comment.


> Those two things cannot be true simultaneously. You cannot say "running big systems is hard" and "they can run a big system better than you can run a small system".

They can absolutely be true simultaneously. In fact, I can't name a logic system where those two statements comprise a contradiction. They are logically completely unrelated to each other.


I ran a GitLab instance for a company pretty much solo and it never wound up going down outside of when I was upgrading it, that I can recall. The thing about that is, it’s a small scale instance. If an upgrade or change failed, I could literally delete everything and restore from backup in under an hour, and never really needed to do that (although to be clear, I did actually test that it was possible.)

Now I’ll tell you the real problem with it: it worked great for people who were located in the U.S., but it was terrible for certain remote employees due to increased latency, and the fix for that would be to set up a complicated replication setup that puts data closer to people not near the main instance.

There’s obviously other issues, but that is one that is probably commonly overlooked that most SaaS solutions can do better on.


What gitlab features do you utilize where latency actually is a problem?

I currently manage an internal gitlab instance with some remote coworkers, but I haven't heard any complaints about latency. Although tbf, I feel like we've barely scratched the surface of what gitlab supports


The most impacted user had ~500ms latency to the server, which is enough that even pageloads are bound to feel slow (especially because Gitlab itself is not necessarily the fastest thing in the world sometimes.) And if I am remembering correctly, I think that was basically it: the UI pageloads just had enough chained round trips that they were way too slow. In retrospect, there may have been some ways to work around this that we didn’t end up exploring.

When I moved to a new company, the GitLab instance was still up, and last I was aware, many years down the road, it is still in use. Feature-wise, we were very much satisfied, and the GitLab CI model with Docker executors worked very well for us.


If you have high latency to a GitLab server consider setting up a GitLab Geo server locally https://docs.gitlab.com/ee/administration/geo/ to improve things (this is a paid feature).


Looks like you got downvoted for some reason, but yes, this is exactly what I am referring to. (We did pay for GitLab EE, so I think we had access to Geo.)


I once worked at a place that self hosted a lot and they had fail-over, backups and restores dialed in

If something like Git went down — which it never — it was a few clicks away from coming back

Working with competent people is a blast


You also need an organization that will spend the money to make that happen. Many of us would like to have such systems where we work, and are capable enough to implement them, but the Finance people say 'NO'.


With something as simple as serving git for your own organisation, when you do get downtime it's going to be a case of restoring from backup and changing DNS. No need for complicated infrastructure, so bringing it back really is that simple.

Worst case is losing commits between going down and backups, but even then it'll be minimal if you set your backup to simply push to a another origin.

I pay for someone else to worry about most services, but git doesn't have to be one of them.


The worst case is discovering the hard way that your backup system isn't working and you haven't been diligent about spending the time to periodically test restores.


That's why testing of backups should be at least partially automated. Though you will always find a point where the automation, and automated tests to make sure that is working, and so on, get to the point where making a human responsible for sanity checking is most resource effective.


Pushing to multiple origins is not a hard thing.


The risk is less about losing commits, since those tend to live on every devs machines too. It's about losing all of the other data like the issue tracker, comments, etc.


If you think "serving git" is the only role github is serving for organizations you are probably not the target audience.


I don't. Which is why I was explicit in the problem I'm targeting.


Very likely to be lower than GitHub/etc, because it's a much simpler system that can be reinstalled from scratch in 30 minutes by an experienced admin on brand new hardware if needed.


Less than you've already wasted waiting for GitHub to come up from the dead in the last year alone. If you're insinuating that we'll have to restore everything from scratch, don't worry, the machine is being continuously replicated and sysadmins are trained to restore everything chop-chop.


5 years of a Gogs instance. Zero downtime. I'll come back to you when it goes down.


does gogs plan to support any kind of CI integration like gh actions?


Gogs and gitea are designed to be a single-binary minimal source code hosting service. Additional services like CI within that binary is likely to bloat it up and make it less reliable. However, it's easy to integrate a CI service to it - like Woodpecker CI [1]. Woodpecker also has a minimalist approach like gogs. If you instead prefer a full package, then gitlab is the best solution.

[1] https://woodpecker-ci.org/docs/administration/vcs/gogs


No it doesn't. I fully confess the feature set is extremely minimal. But the point here is that running your own infrastructure can make a lot of sense if you know what your precise requirements are (or are not).


Depends on your setup. For a simple git server on AWS I might have it on a single-node ASG so that when something terminates the instance, the ASG notices and launches a new one. The initialization scripts would then attach and mount a pre-existing EBS volume and start services.

Very simple to set up, and as long as your data volume is not lost (which can happen, but is very rare), such a system would recover from a host failure by itself faster than most people could react to it going down. In the unlikely case that you lose the EBS volume, you'd just recover from snapshots manually.


It only takes one month long outage of an international undersea cable to negate all the benefit of SaaS when that SaaS is hosted in or heavily depends on other services hosted in the USA or Europe.

DNS goes down for a particular dot com? BGP hijinks? Backhoe through the major fibre serving your workplace?

All of these are going to happen and will lead to downtime out of your control or the control of the SaaS provider.

Thankfully git is designed to be decentralised and developers can continue to work even if they have to use paper books as reference material and not Stack Overflow.


I've been running gitlab for ~5y in companies with ~200 devs and haven't had a single downtime, it also takes very little work/effort.


Infra I have been running myself only goes down when I reboot INTENTIONALLY.


It's not about being larger, it's about doing changes. GitHub used to be pretty stable before, when they didn't really do any (user facing) changes. It stood still in terms of features, but was also pretty stable overall.

Compared to GitHub, how often does your Gitea instance go through code changes? How many major features have been switched on in that instance during the past year without any hiccups?


I tried to look at the Gitea repo to see how fast they get commits, but github is 500ing :-(

https://github.com/go-gitea/gitea


Rest assure, a open source project with ~100 contributors (whereas maybe 10% regularly contributes) has less activity and changes than a multi-international company focusing a SaaS with "1000 to 5000" employees worldwide.


For perspective, GitHub has 4,000 employees who list them as their employer on LinkedIn and GitLab has 1,700.


GitHub's feature churn is irrelevant to the comparison unless those features are used in parent poster's workflow.


Really? No, comparing the uptime of two services (actually, one service and one open source self-hosted service) it's relevant to consider _why_ one would have more downtime than another, even if they don't use those features. Even if the parent posters workflow doesn't use the new/deprecated features, they will still be affected by the downtime GitHub have monthly.


I don't think anyone is overlooking the reasons why one might have more downtime than the other.

The problem is that ultimately the downtime itself matters and not the reason, and if you don't need any of the features that GitHub offers, then the self-hosted route is a better option.


> if you don't need any of the features that GitHub offers

In particular, all of the new features whose addition causes instability.


I think it's the development process (as in moving things around and potentially introducing bugs) than the new features per-se.


With all due respect to the developers, it would not speak well for GitHub if they can't maintain stability while refactoring. There's a lot of testing processes and CI processes explicitly around making those actions safe.

New features breaking is a lot more understandable - even expected - than regressions and refactoring failures.


I run Gitea myself as well. I generally throw in a new release once a month or so, never had any real hiccups.


Again, how much it actually change across those releases? GitHub most likely are going through more changes on a weekly basis (if not daily) than what Gitea does on a monthly basis.


The major releases generally change quite a lot in one go, for example:

https://blog.gitea.io/2022/02/gitea-1.16.0-and-1.16.1-releas...

https://blog.gitea.io/2021/08/gitea-1.15.0-is-released/

But I'm sure Github changes a lot more. However, the question is if those changes are worth the instability. For my workflow I just want my repos to be accessible, have my source, issues, and wiki accessible, and of course my basic git operations working. If that core functionality is actually not working at times, I'd definitely consider changing the git provider.


it sounds like you dont make code changes unless they are required for function.


Also for security reasons, and because some features are nice to have. But you're right in that I'm definitely not randomly updating sometimes. When I do update I basically just back up the database, throw in a new Gitea binary, and restart the service. As the releases have been fairly well tested in general, this hasn't given me issues so far.


An infrastructure company focusing on feature churn just sounds wrong.


What? Github is a social networking company.


I'm a bit dumb about these things on text, but did you drop an "/s" somewhere? I understand it has sort of a social networking layer on top, but that is not its main function right? This would be like calling a calculator app a social networking app because it allows me to share my results :D.


GitHub has always branded itself as a "social collaboration platform" rather than anything else (well, first "git hosting" but secondly the social stuff).

Here is a early version of their landing page from 2008 (the year they launched): https://web.archive.org/web/20081111061111/http://github.com...

Notice the logo says "Social code hosting" and the messaging of the page is mostly around popular repositories, collaborative features and other social elements.


Gitea is nice, what CI/CD system do you use with it? I used it in the past with Drone, and Drone left a lot to be desired (coming from/going back to GitHub actions, Drone's job syntax is a bit more complicated and focused on containers).


Drone CI has been working fine for our purposes. I agree that it may not be the most feature-packed thing in the world (for example, caching support is missing completely), but there are lots of plugins and it's easy to write your own.

We've been using exec and ssh runners when containers are not enough (for example, when you need to juggle a few VMs or build some complicated Docker image, and using docker-in-docker is not fun).

https://docs.drone.io/pipeline/exec/overview/

https://docs.drone.io/pipeline/ssh/overview/

GitHub Actions are nice… when they're actually working.


I've used Drone CI in the past with Gitea. I still prefer gitlabs CI tho


> A self-hosted Gitea instance my organization has been using had exactly zero downtime in the past year.

> Yes, I know, GitHub is 10⁹⁹⁹ times larger than our puny Gitea instance, and that's why they're having issues

Exactly. That is also my overall point. [0] It makes no sense to go 'all in' on GitHub and something goes down and everyone is stuck once again.

[0] https://news.ycombinator.com/item?id=30711442


Opposite anecdata my GitLab install absolutely self destructed itself at one point and I had to rebuild it myself, waste of a weekend


> my GitLab install absolutely self destructed itself

That’s frightening, we’ve been using it for multiple years now. It’s running fine and the only short downtime is every few weeks for the update.

What happened to your instance?


More fairly to GitLab it was the underlying server that broke initially. The issue from there was that I couldn't restore from the backup as the newly installed server was also a newer version of GitLab

Tried restoring it to the older version to run through the upgrade path and the process didn't want to work. Wish I could be more specific here but I gave up on it and decided it'd just be quicker to whip up a script to recreate all the groups/projects/CICD in a new install and push repos from backups

Don't let your GitLab server get outdated and don't consider it a valid backup unless it was taken with the latest version to avoid the same

Also it wasn't relevant here but make sure you're backing up the etc files too, while that wasn't an issue for me it could easily trip you over in a worst case scenario ( https://docs.gitlab.com/charts/backup-restore/backup.html#ba... )

Not a complaint, it's just one of the hats that needs wearing sometimes, would have been easier on me if someone else was handling it though :P


Thank you. From what I know direct updates to the next major releases of gitlab are only supported from relatively recent releases. I guess that’s what caught you.

We’ve been updating about as soon as an update is released for a while now, with very good results. (I’d rather have to restore a backup or a snapshot than to be running without the latest security fixes. If gitlab is down for an hour (didn’t happen in ~3 years) at least all developers have their local repo.)


Mind sharing some details on your install? Which database are you using (especially wondering about sqlite with gitea)? Are you using the docker install, or from package, or in k8s?

No pressure, but I'm seriously considering standing one up but it would need to be production ready.


It's a really boring setup, I'm unlikely to share anything of much value here.

No k8s or anything like that. An Ubuntu LTS virtual machine on top of I don't know what hypervisor (probably Hyper-V). Gitea requires very few resources.

Data is stored in PostgreSQL 11. Probably should be upgraded at some point, but it's working fine for now.

A Drone CI instance runs on that same machine. It controls CI workers on a dozen other VMs.

Everything is in Docker containers under docker-compose for two reasons: easy upgrades, and the ability to shove all data into a single directory.

caddy for HTTPS.

Sonatype Nexus for package repositories (caching upstream npm, nuget, maven, composer, and our own internal repos). This one is pretty heavy and will probably be moved to a separate machine at some point.

Never had any issues with any of these in about three years we've been running this setup.

I was a bit facetious with the "zero downtime" statement. It has about 30 seconds of downtime per month, but it is always planned and outside of working hours deep into the night.

There's been zero seconds of unplanned downtime.

Basically you do:

    $ docker-compose pull
    $ docker-compose up -d
and half a minute later it is running new versions. If anything is broken (it hasn't been yet), call the sysadmin guy, and a few minutes later he restores a VM snapshot.

Backups are being done on the hypervisor's level. I don't know about that much.


cool, thank you!


Gitea is absolutely production ready, we use an older version by now and don't even think about it much because it just works so well. Exactly what you want in production (production = software should work more often).

Actually not "vendoring" dependencies from github is very much not production ready in many cases.


IMHO storage is the big issue to figure out with a production ready gitea installation. Gitea only supports the local filesystem for storing bare repo data and not anything like S3 or other distributed or highly available storage systems (it supports S3 only for git LFS data right now, it is a long awaited open issue to get the bare repos there too).

So the big issue for any non-trivial or highly available setup is going to be how you get high availability for the local storage volume. There are tons of options with various tradeoffs and levels of complexity here--simple local disk RAID, distributed filesystems like gluster or ceph, etc. I think this is the real crux of getting a good gitea instance going.

No matter what figure out and test a good backup and recovery solution!


Ah interesting! I read somewhere that it uses the database for that, but I think they were mistaken.

Thanks that's quite helpful. If we do it in prod we'll have to accept single-instance.


Yeah I forgot to mention too a lot of complexity goes away if you host it on a good IaaS platform. On AWS for example you could just mount an EBS volume as the bare repo root and use RDS for a database. Then you just have to deploy and manage an EC2 instance to host gitea's main process. It's more expensive to use those managed platforms but might be well worth the cost vs. doing it all yourself with a bare metal deployment of everything.


10^999, really?

Rule of thumb is that 10^100 of anything doesn't fit in the observable universe.


Git is distributed, you should be able to completely detach from your remote and still work effectively. I think a lot of big projects get too lazy about good, reproducible build environments and just rely on centralized CI to do everything.


If you haven't noticed, the whole point of GitHub is to infest git-using environments with so many centralized features that user's get trapped using that service. Seems like issues and actions are doing that job nicely!


GitHub isn't just a repo in many cases. It is project management software.


This is getting too much.I wonder what effects this will have on the reputation of Github. If I am blocked by Github every few days and can't properly operate. I will start looking for a more reliable alternative. And I am just a solo dev that pays $4, what happens to companies for whom time is worth a lot of money? Pretty disappointing.


How do you as a solo developer get blocked on anything related to GitHub? Are you doing all project planning on GitHub without any other copies elsewhere? Most if not everything you do on GitHub could be replicated locally, one way or another.

I can understand the problem for larger development teams, as a lot of communication and workflows can happen via GitHub Pull Requests or similar.


Personally, I can't deploy if GitHub is down. CI pulls from GitHub. One of my projects includes libraries as submodules, with GitHub as the origin. The deployment configuration has everything cloning from GitHub.


> Most if not everything you do on GitHub could be replicated locally, one way or another

That's not really the point though.. If you have services in github etc, why would you replicate all of them locally too for unplanned downtime ?


This is quite similar to the somewhat philosophical question of having backups: if your job is to make sure your company has backups of their data, and you entrust a single vendor with that data because they take care of backups so you don't have to, are you really doing your job? Said another way: if you (or your immediate team) are not the party that turns one thing into two things, then there is still a SPOF in your wheelhouse, and the whole point is for that to not be the case.


SPOF is something that you have to make trade-offs for and decide what you can cope with.. How much work to maintain a parallel infrastructure _just in case_ GitHub goes down for more than an hour once or twice a month? How many people actually are in the position to pivot their entire company processes around this failure case?

It can't be "remove all SPOF or you aren't doing your job". How many offices are resilient to continue working if the building power goes out?


Maintaining an entire secondary deployment and CI system just seems unrealistic to me. I'd get so much flack about the extra spend and when the systems get out of sync and causes downtime, that investment ends up with net negative for all involved.


Since most executives dont actualky care if backups work for real just that they are not liable if the backups fail. Yes.


Well, usually you figure out how to run your tests, check code coverage and build your binaries locally first, then figuring out how to get it to work in the CI environment, not the other way around. So the replication has already been done, but in the direction of local environment > remote environment.


I don't disagree, really local environments should be able to manage dev and most daily tasks, but large integration pipelines, test automations etc.

I just don't feel the need to defend GitHub here for the reason of : "Well you should be able to manage locally anyway"


I am a solo dev, but I work in a company and now have been working on pushing something to test it on Github actions. If it was something different, maybe. But I understand how I might have miscommunicated that.


I'm also in a company, but today am building up something on Github Actions.

Every single time I start doing development on Actions, GitHub goes down or actions start failing. I'm not sure if this is saying something about me, or GitHub.


Try building any Rust project while GitHub is down.


That's only applicable if you use crates directly from GitHub repositories right? I've been sitting and developing+building a Rust project for the last +10 hours (except the minutes when I was here arguing on HN) and never hit any issues, but I'm using crates either directly from disk (cloned from GitHub initially) or from crates.io.


I could be wrong, but try a fresh build with a clean CARGO_HOME (or maybe just a cargo update?). It seems to query a lot if you don't have a full cache.


While it's true that Cargo will attempt to check on some things, you can pass flags to tell it to not bother. Specifically, https://doc.rust-lang.org/stable/cargo/commands/cargo.html#o...


Maybe using their CI/CD?


CI/CD should you perform tasks that you could execute on any machine, besides the deployment, which should work differently locally VS via CI service. But it should still be possible to deploy without CI, otherwise you set yourself up for being unable to deploy when you really need it.


Github-actions is not created to encourage that though. It uses syntax that only works on Github. Look at their workflow-examples and you will see.


Don't you define things in your repository first (usually via a `Makefile`) and then call those things in the CI environment? Or are you building things differently in the CI environment compared to the local environment?

I mean, if you're working on a project that has tests, code coverage and binary builds setup, you usually have a `Makefile` or `package.json` or whatever to run your scripts, and your CI setup just calls those very same scripts but in their environment (sometimes with different arguments/environment variables).

Not sure why it would be different for GitHub Actions. It's certainly how I use it day-to-day.


In my case, at least, GH actions is the only place with all the secrets necessary to deploy my (small) webapp. Sure, I can generate alternative tokens and pull some things out of 1password, but it'd be time consuming. (Also, changing things like JWT secrets is less than ideal.)

There's also just the number of things it checks. jest runs, lint/build, e2e and acceptance tests, 2 docker builds pushed into ghcr, and then ansible to deploy. It's mildly error-prone to do myself, especially the docker and ansible steps because that's where the secrets come in.

So sure, it CAN be done manually, but the entire point of CI/CD is to do everything consistently, repeatedly, and without the risk of manual error. It took me hours to figure things out the first time. Why would I want to risk doing things manually now?


no that's not how it works at all. The "actions" are proprietary to GitHub and hosted on GitHub. People create custom actions and allow others to reuse them. Everything is hooked in to GitHub via their proprietary yaml config.

> Not sure why it would be different for GitHub Actions.

because vendor lock-in. GitHub doesn't want to make it easy for you to switch.


If you have vendor lockin with GitHub Actions, it's because you chose to do it to yourself. Nothing prevents you from using only the `run` action to run a shell script so that everything that CI does can also be done on your dev machine.

Both my personal projects and my $dayjob repositories have every test, etc triggered via `make test` or `test.sh`, then the GitHub Actions workflow YAML just `run`s it. Secrets also work fine - the makefile / shell script expects them to be defined as env vars, so the developer running them locally just needs to define those env vars regardless of how they obtained the secrets.

https://news.ycombinator.com/item?id=25104253


thanks for the downvotes.

> it's because you chose to do it to yourself.

I didn't choose shit. My company did. Why are you putting this on me?

> Both my personal projects and my $dayjob repositories have every test

Congrats on not actually using GitHub actions? I guess?

So many people here sucking Microsoft cock. And there is yet another incident today! What's that make, three days in a row now? Four, if we're actually counting. They aren't even hitting 2 nines uptime. Two. Fucking. Nines. Going on many years now. But apparently that is just fine because everyone is running self-hosted infra in parallel to their cloud shit.


>Congrats on not actually using GitHub actions? I guess?

>But apparently that is just fine because everyone is running self-hosted infra in parallel to their cloud shit.

In your haste to complain about downvotes and accuse other people of "sucking Microsoft cock", you forgot to actually read the comment you replied to.

Things in the comment you replied to:

1. An assertion that one can write their CI in a script that is not tied to one CI vendor, so that it's easy to run the same steps as what the CI does locally or in another CI. ie, no lock-in to the current CI.

Things not in the comment you replied to:

1. An assertion that I run self-hosted CI in addition to GitHub Actions.

2. An assertion that GitHub has good uptime.


Sure, but then I guess you always install the dependencies from scratch, can't utilize cache, and won't have a dependency graph as well.

Parallelising your build/deployment will likely also be harder to do.


>Sure, but then I guess you always install the dependencies from scratch

Sure? On your dev machines the dependencies are already installed. On GHA VMs the network is fast enough that installing deps is not slow. What's the problem?

And if you really have a problem, presumably you have some master tarball / container image that your devs use to set up their dev machines because installing your deps is so complicated, so scp / pull that in your script?

>can't utilize cache

A cache is necessary for CI VMs that are cleaned for every run. When building locally your dev machine already has everything the cache would have.

>and won't have a dependency graph as well. > >Parallelising your build/deployment will likely also be harder to do.

You realize the two things shell scripts are good at is running commands either series or parallel, exactly how you want them? Instead of learning a brand new DSL to be able to do `if` and `&`, you just write `if` and `&`. This is already covered in the comment I linked.


You make some good points and I tend to agree with your approach. What you have to do though is skip using "uses: actions/", "needs", and won't be able to use other actions that has been published to the Github marketplace. So you have to make a conscious decision to go against the way people usually uses Github actions and won't be able to utilize some of their features. You avoid vendor lock-in though, which is what we are talking about :)


Right, that's the idea. Not only are these "actions" the cause of lock-ins, you also have to treat them as dependencies that you're trusting with your repo integrity, secrets, etc. As soon as you start using third-party actions you have to start worrying about https://docs.github.com/en/actions/security-guides/security-...

SourceHut is one source code hosting with the right idea here. Its CI only has the equivalent of the `run` command for running shell scripts - https://man.sr.ht/builds.sr.ht/manifest.md#tasks


Yes. I have debated with myself before which route I should take. I went with the Github way, just because I felt uncomfortable making a decision like this that goes against all the examples. Perhaps it was the wrong approach. Another problem with the Github way, is that functions/code-reuseability is almost non-existent.


Are there any CI/CD metasystems that aren't vendor locked-in? GitHub/even GitLab, Jenkins (freestyle and scripted)/Azure/Travis.... all vendor-specific, as far as I know.

Sure, have all the heavyweight stuff in separate scripts that are just called, but platform specification/multiple platform builds/specifics of caching/secret handling/deployment handling are always different. Some tools (e.g. codecov) do abstract over some platforms, but not all, and the GitHub Actions model of "here is a literally pre-prepared step in your pipeline" can be pretty appealing.

It's literally, pick your poison, and resign yourself to reimplementation if you ever need to switch platforms.


You can run github actions offline

https://github.com/nektos/act

I don't think github is trying to create lock-in, I think rather they were trying to make a way to easily share actions (not sure what other CIs systems are designed to have an ecosystem of publicly shared actions). The actions are public and therefore easy to make something that interprets them.

I can only guess at some point there will be a push for CIs to converge on some "actions" standard, maybe?


That's neat, though it is a 3rd party trying to replicate it.

Gitlab's open source runner supports parsing the ci yml and running a job locally, presumably using the same code as the platform, like:

  gitlab-runner exec docker some-job-name
Though they would both suffer on any dependencies on platform hosted environment vars, secrets managers, etc.


> But it should still be possible to deploy without CI,

Solo? Yes. In a team? Less often just because you might have credentials stored in secrets that are not shared with the entire team for security reasons.

In larger corps it's easier to wait it out that go through the trouble of escalating permission requests.

But I do get your point. Personally for deployments I have Ansible playbooks that are invoked via straighforward calls which can be done locally.


Yeah, but theory and practice are very different things. If CI always worked for me and my experience was great with it. I will not invest time to do independent CI for everything. Hence depending on Github Actions. Clearly not the best approach, but it's about being pragmatic also.


What do you mean "independent CI"? You start with a `Makefile` with the commands for testing, building and so on, and CI is just calling those commands but on another instances, that likely gets triggered for each commit/tag/pull request/whatever.

When you first setup CI, you don't actually just develop for the CI environment first, that's the second step after you know how to perform the same things locally. So not sure why there would be "independent CI for everything".


It's easy enough to unintentionally box yourself into a hard 3rd party dependency. Like the github secrets store, conditional logic or triggers in the ci.yml files, pipeline driven semantic versioning, etc.


It seems the pre-MS GitHub's approach of "we have maybe 1 new feature a year in the pursuit of keeping our duct-taped service afloat" might have been the right call. Nowadays tons of new features are pushed out every week[0].

0: https://github.blog/changelog


I get what you're saying but I cracked up at this "GitHub is releasing too many features and breaking things!" because when I clicked your link the latest entry is:

> Diffing Adobe Photoshop documents is no longer supported


How can a system be stable if you don't at least scrap two features a year?


Someone who decided to self-host their own source control system here, figured i'd comment about some of my experience due to some people mentioning it as an alternative. I've personally run GitLab in my own homelab for a few years but after some updates going wrong, i eventually moved over to Gitea, Nexus and Drone, also self-hosted: https://blog.kronis.dev/articles/goodbye-gitlab-hello-gitea-...

GitLab is a large and complicated piece of software, which is a double edged sword: you do get an integrated issue tracking solution, Docker Registry with integrated access controls, as well as a pretty decent CI solution, which is great. At the same time, however, there's a chance for one of those subsystems to start misbehaving and bring down the whole thing (e.g. Omnibus install with default configuration for where to store container images --> running out of space; especially relevant if you don't find a good way to clean up old images), which may or may not be harder to solve than having 3 separate integrated systems.

Also, GitLab is pretty resource hungry and migrating to another solution actually improved the performance of certain systems/parts of functionality bunches: Gitea is really fast, Drone CI is about on par with GitLab CI and Nexus is a bit faster than GitLab Registry, however also eats too much RAM (e.g. the same problem within both Ruby and Java, Nexus seems to fail to start up below having approx. 1 GB of RAM, GitLab needed at least 2 - 3 GB of RAM).

However, running GitLab at work is still a pretty good idea, as long as keeping the instance up and running, as well as updated is someone else's responsibility and you can focus on just being the end user. At the same time, however, the update paths can be problematic sometimes, as was the case with my container based Omnibus installation that was multiple major versions out of date.

But for personal use? The Gitea, Nexus and Drone setup is perfectly sufficient, alongside maybe some issue tracking software (be it Kanboard, OpenProject or something else). In short, to the folks who are considering self-hosting in the comments: you can definitely try it out and have it be a learning experience, maybe on an affordable VPS or a device you have laying around!

Furthermore, there's nothing really preventing you from mirroring your repos over and keeping them in multiple systems, alongside any backups that you might be making!


Github broken. CircleCI broken. I'm going for a coffee.


Its the snow-day of the tech world.


Cannot really rely on Github lately. Wondering if they have some sort of post mortem for the incidents of last month. I would like to see why we should trust it to not go down again.


> I would like to see why we should trust it to not go down again.

You shouldn't ever trust anything to never go down again. It's just inevitable that services go down from time to time, you can't change that unless we have a major breakthrough in software engineering where somehow people stop making mistakes.

So instead of figuring out if you should "trust it to go down again or not", assume it will at one point in the future and plan accordingly. This is how you build resilient systems.


As someone who is responsible for CI/CD...

welp.

I'm thankful that at least I'm not on security investigating the severity of the okta hack.


Ah the semi-daily github fuckup.

For this one, I found myself rage-screaming at my monitor as their issue markdown preview API was timing out ~80% of the time.

Github enterprise is probably happening for us this year. I don't give a shit about the new features in the public build. I just want it stable and working at this point.


From [0]

> Until the next time GitHub goes down again (hopefully that won't be in another month's time).

Oh dear. [0] Never expected it to be this soon or that bad. Expected it to be down in a few weeks at least. Last time this happened was just 5 days ago. [1]

Yet I see complaints here of GitHub being unreliable as predicted and even some considering to eventually self-host despite paying $4 a month even.

I don't think it make sense to go 'all in' on GitHub, does it?

[0] https://news.ycombinator.com/item?id=30714123

[1] https://news.ycombinator.com/item?id=30711442


Github is a platform that comes with all platform-related risks, i.e. operational risk of having the whole development and SDLC lifecycle blocked, creating a vendor lock due to using some unique github features, inability to rely on third-party dependencies hosted on github.

There are a couple of strategies one could rely on to mitigate the risk of github downtime:

- create a map of features between github and competitors, find the common denominator and risk accept anything that is unique to github but viral for your business/product;

- maintain a primary self hosted instance of gitea/gitlab/sourcehut and use github as a mirror; - use github as a primary platform, maintain mirror with public gitlab and switch to gitlab in case of outage;

- if github is used beyond its primary purpose(git hosting) put some efforts to maintain the same features in gitlab/gitea/sourcehut, i.e. use CI features of both platoforms and push releases and artifacts to both of them so that end users have the choice.

- separate concerns and not go all in with github as a platform for the whole SDLC and project management and instead use different tools/platforms for different purposes.

Yes, it takes more time for the initial setup but it quickly pays back.

The above applies to those cases where your products need to be hosted publicly.


Was wondering about multi-remote git workflow and found this: https://jigarius.com/blog/multiple-git-remote-repositories

This gets tougher with CI/CD though. How urgent is it to run your tests or deploy right now as opposed to when the outage is resolved later, or tomorrow? In emergencies I think you need ways to deploy without the CD pipeline (for those that can't currently).

In my experience there's always more work to do and downtime is a temporary inconvenience. Of course... my customers would be blowing up our phones and emails if we were down, threatening to quit, etc. I assume Github doesn't have quite the same level of pain when they have downtime.


Today, I saw a beta timeline feature. Could it be this incident related to this release?


Instead of releasing unnecessary social-like stuff they should have focused on their core git service. Seriously, what’s their uptime? Do they even have a number somewhere? Actions must be down every second day or so.


'Be what's next', or how to ruin the brand of a 7.5B acquisition.


Microsoft has big plans for github. Maybe the brand is a bit tarnished by the time they realize those plans, but if they succeed in fully integrating the developer productivity stack into both the github app and site they'll see a very large amount of revenue.


What GitHub app? edit: turns out to be a flavour of vendor lock in, https://docs.github.com/en/developers/apps/getting-started-w...


Visual studio will be rebranded as the github app.


pretty wild how these outages are basically just tolerated now. folks just get some coffee refresh HN a dozen times and then resume work when it comes back online. no huge uproar, no exodus from GH


Everyone with projects small enough to kneejerk move around migrated to GitLab when Microsoft bought GitHub


It was a foreseeable inevitability that once SaaS took over from on-prem that the quality would degrade to what we were getting from on-prem.

...but if you said anything you were just interrupting people playing with their new toys.

The shrug whatever coffee break while another department fixes things is the same now as it was back then.


Has anyone made a good SaaS product for self-hosted Github Actions runners? I'd gladly pay for beefier VMs from a 3rd party provider. Actually self-hosting is a pain in the ass because Github gives you only the runner software but none of the containerization/stateless wrapping I'd expect. Without some containers and scripting you'll get permanent changes to your VM's filesytem as you run actions.

And the VMs GitHub provides are far too low spec to run Cypress in. There's no option to pay for more resources.


What surprises me with many of the outages from big companies is that it seems like they have a bad setup when it comes to zone redundancy. Fact is many of the smaller companies I worked with are much better at this. Facebook, Fastly, Github. How can companies like that fail at such basic concepts? Really mind boggling. There must be something fundamentally flawed in their development process.


Perhaps they've all gotten big enough to run their own DNS?


I just got bunch of errors - error: failed to push some refs to 'github.com'. Luckily after a while the error disappeared.

I was going nuts. For my own products I use gitea and host repos myself. Some clients of mine however use github so in this case I have no choice


Anyone think the warnings issued of Russia's cyber attacks has any correlation here?


I'd personally correlate it with their Azure migration. Either that, or recent launches. Stability aside, they've been shipping a lot of changes recently. And more often than not, changes lead to outages.


Seems likely that migrating to Azure is related to being acquired by Microsoft. I blame the new ownership for the decreased stability.


A migration of that scale is difficult to execute on. I wouldn't be so quick to blame the whole Azure platform, unless it really is something in Azure that's responsible for these outages.


Yeah I could see that... with just them going to Azure alone lol. Azure was very unstable at a previous company I was at.


I would expect any serious Russian cyber attack would more closely resemble "polonium tea" in the sense that it would be: incredibly destructive, quite obviously from a state actor, and yet officially not connectable to Russia.

Strategically cyber attacks are the most effective when they damage your adversary heavily, and yet at the same time make it so that politically retaliation looks like aggression since there is no unquestionable link between the actor and the action.

Consider Stuxnet as an example. Widely understood to be a joint US/Israeli cyber attack on Iran, and at the same time difficult to retaliate against.

While annoying, these outages we've been seeing are hardly doing any serious harm. Very little money is ultimately lost because of them, and it's not remotely clear that this is a cyber attack. So if they were cyber attacks it wouldn't make much sense. Not saying that it's impossible they are, but it seems there are many other more plausible explainations.


No


It would be nice if GitHub started operating the way GitLab does. Free community edition (open-source) and paid proprietary Enterprise Edition (source-available). Developement being completely public (employees push their commits, do code review and work on issue boards on the official GitLab repo[1] which is publicly available)

[1] https://gitlab.com/gitlab-org/gitlab


My boss threw his hands up and said we a switching to gitlab. I love github, why you doing this to me Microsoft!


It seems incidents like this with the big providers are becoming a lot more common. Wonder why that is


With all the recent incidents have they broken their SLAs yet?


A SAAS company never breaks an SLA. SLA's are written by the legal team. Uptime is maintained by the engineering team. They aren't even in the same building.


Yup. After all this isn't downtime, its "degraded performance":

> Git Operations is experiencing degraded performance. We are continuing to investigate.


Microsoft SRE always sucks. GitHub is just another victim.


Thanks! Just had problem


can't open a PR right now.


went to downdetector first but this is betta


Can we please stop posting major system incidents and status pages for major cloud/SaaS providers? What's the point?

Post mortems - TOTALLY fine. But just linking to a specific incident is IMHO not valuable to HN and just takes unwarranted frontpage space.


I find this valuable.

-Sometimes for conversations about alternatives

-Sometimes for conversations about the product's stability history

-Sometimes just for solidarity in frustration


> -Sometimes for conversations about alternatives

So then submit a post mortem and drive the discussion there. Or alternatively post a "Ask HN" if you're legitimately considering alternatives.

> -Sometimes for conversations about the product's stability history

Which are readily available on those provider's status pages.

> -Sometimes just for solidarity in frustration

Isn't that what Twitter is for? Why should the frontpage be littered with <insert provider here>

My point is this - let's say GH goes down 8x over 2 weeks for 10 minutes each. Should there be 8 posts on the frontpage about GH going down each time? No, there should be ONE, otherwise the valuable discussions you're referring to get lost.


I think it's so people realise it's not just them.


Makes no sense. If you're having issues wouldn't you just go directly to <insert service here>'s status page first? Why HN first?


Because status pages lie.


In many cases the status page of a service is an unreliable indicator (e.g. AWS's) so I find HN / Twitter to be a much better signal that an incident that affects many people is happening.


Status pages are just not reliable unfortunately.


The point is notifying the active community here and discussing the reason(s) and impact of the outage, or in GH's case their regularity I guess


So then write a post that says "Ask HN: What is going on GH's being out X many of days". Otherwise you get 6 posts across weeks with the same discussion.


I can understand your viewpoint but just to offer a counter I always check HN for major outage news when it comes to github/AWS/GCP/Azure. There have often been comments with bits of information as to the cause and remediation efforts.


> bits of information as to the cause and remediation efforts.

That's what post mortems are for. Happy for people to submit those.

Also, why are cause and remediation efforts important to you? If you rely on the service and it goes out, what more are you going to learn here that would help you? Genuine question.

> I always check HN for major outage news when it comes to github/AWS/GCP/Azure

Why not just check those status pages? Or check Twitter? HN seems like a weird place to check for it...


> Also, why are cause and remediation efforts important to you?

Because I have to provide information to others in my company about why X is not doing Y. These bits of information help.

> If you rely on the service and it goes out, what more are you going to learn here that would help you?

Roughly the same answer - though I'm also just actively curious what's going on as well in the moment.

>Why not just check those status pages?

I do. HN often has more information.

>Or check Twitter? HN seems like a weird place to check for it...

I don't want to.


HN is frequently discussing outages before the status page updates...

(I recognize not in this case, but I certainly noticed this outage and checked HN before the status page was updated)


If there was no virtual public space in which to tar and feather shameful technical practices (or at least review them honestly), they would probably be even more prevalent.

I also see these as an excellent learning opportunity. Typically, it is a lesson in managing complexity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: