Hacker News new | past | comments | ask | show | jobs | submit login

The fact that Github has been so unstable for so long is absolutely insane to me. I know ops is hard, but this level of consistent outage points to an endemic problem. Is it the legacy rails/mysql stack that is the largest culprit or is there systemic rot in the engineering org?



More likely, it's efforts to migrate away from the previously solid Rails stack to MS's preferred stack.

They've had a long history of this kind of stability issue when migrating or trying to migrate acquisitions from their previous stack to an MS one. This happened with Hotmail (Unix server -> Windows server), LinkedIn (custom cloud -> MS cloud) and others since.


Is Github moving to .Net and/or SQLServer, or is it "just" moving everything to Azure?


Most of GitHub runs in its own data centers. Services like actions runners and Codespaces use Azure.


The latter.


The LinkedIn to Azure migration was indefinitely postponed.


Never heard of linkedin problems before.


No need to speculate, Github posts fairly detailed information on availability and outage causes https://github.blog/tag/github-availability-report/


> so unstable for so long

Has it?

I’ve had hardly any problems. Occasional issues, but rarely have I been impacted to the extent I notice for more than say an hour…. maybe I notice it a couple times a year.

My internet access at home is more likely the issue when I hit GitHub issues.


I've experienced a ton of issues, but it's likely because almost every aspect of our operation depends in someway on GitHub -- the repo itself and basic push/pull, or PRs, or webhooks, or actions, or some aspect of status updates, or random API tasks, etc. We use a lot of GitHub.


Yeah I can see how the more tied you are the more exposure you would have, especially for very frequent / long running tasks and etc.

My exposure is increasing, but it's still intermittent / not stuff ongoing all the time and if it doesn't work ... oh well it will run later when it does. So the impact is lower than some.


Why people think it can be related to Rails when there are tons of companies out there using Rails not affected by this degradation?


>Why people think it can be related to Rails

Probably because there have been more high-profile stories of companies migrating off of Ruby on Rails to something else (e.g. Java, Go, etc) rather than vice-versa of migrating into it.

E.g. the high-profile story of Twitter's previous "whale fail" scaling problems supposedly being partially solved by switching from Ruby to Java/Scala/JVM : https://www.google.com/search?q=twitter+whale+fail+ruby+rail...

Ruby may be unfairly blamed but nevertheless, the narrative is already out there even though other big sites like Shopify, etc still use it.


One difference is that Rails and MySQL on a Github scale is rare, even when taking into account Github scale is rare.


you mean that most of the other popular huge Rails companies (GitLab, Shopify) use PostgreSQL? Basecamp uses MySQL tho


Gitlab is, AFAIK, nowhere near the scale of Github. Shopify IDK, but I'm fairly certain their type of usage is very different. Basecamp is also of another scale, and certainly a very different usage/performance type.


Shopify primarily uses MySQL, unless something major has changed recently. They've done a number of conference talks and engineering blog posts about their usage of MySQL, see e.g. https://www.google.com/search?q=mysql+site%3Ashopify.enginee...


Sometimes it's not Ops. Sometimes its crap code. ;)

Source: <-- OPs


Clients I worked with: Our service crashed, why?

Because you designed and implemented it poorly, that's why. Alternatively: How should I know, you wrote it.

If you're ever bored as a developer, switch to operations, you get to be the person developers turn to when they can't code, debug, do logging or security.


I've never really been interested in straight development, though I do enjoy the occasional coding session. I'd say at this point, I'm ops through and through, and I feel that the value I add is understanding the systems and what they're doing at a fairly low level. As such, I do sometimes have to help developers out with things I consider surprisingly basic (not usually code exactly), but that's the nature of teamwork I suppose; I'm happy to have a place in things and don't shy away from epithets like "yaml wrangler" or "helpdesk for devs".


Sadly devops will be one of the first to go as AI progresses.


While on the surface I'd agree with you, in reality I think operations people are going to be around longer than developers at this rate.

It's fairly "easy" and relatively safe to let an AI loose on your Java code base and use it to add new features or find bugs. Very few people would let a similar AI roam around production servers and databases.

If you collect enough logs, exceptions/crash dumps, network traffic and so on, you could feed that to the AI and have it tell you why a service crashed. The majority of my job as an operations person is to figure out why something crashed with only a subset of that information and being able to read the code and reason about why current circumstances resulted in the crash or data corruption. Sometimes the job is even to implement the stuff the developers didn't, while not actually touching the code and relying on what the operating system, database, web server or network tells you.

If developers where better, or had more time, more resource then yes, an AI could do the job faster and better. In current environment, operations is pretty safe.


I sure hope so. But realistically AI will thin the devops herd much more rapidly than "proper" development teams. I already use an LLM to crank out my configs, shell scripts, analyse my logs etc. LLMs are a lot better at these things than full fledged development. I think I am now able to do in days what it would have taken me weeks. My employer does not need to hire as many devops people as it would have pre LLMs.

As you can see you don't need to give AI write access to your production environment.


AI can't swap a drive. AI cannot clear a printer jam. AI cannot replace a spinning rust drive in a laptop with an SSD.

Trust me, there will always be OPs people.

Source: 30+ years in Ops/SysAdmin


DevOps isn't really ops though, right? As in, not product ops. It's ops but for devs, so it's rare for them to have to handle production servers. At least, I hope so. Those would be SRE or even sysadmins, right? I'm not up to date on the usage of the term though haha.


AI is just more software to manage and operate.


Every time there's a GitHub outage of any severity one of the top comments on HN is inevitably suggesting that it's probably due to Rails. It's getting pretty tiresome.

Calling it a "legacy rails" stack is incredibly disingenuous as well. It's not like they're running a 5 year old unsupported version of Rails/MySQL. GitHub runs from the Rails main branch - the latest stable version they possibly can - and they update several times per month.[^1] They're one of the largest known Rails code bases and contributors to the framework. Outside of maybe 37 Signals and Shopify they employ more experts in the framework and Ruby itself than any other company.

It's far more likely the issue is elsewhere in their stack. Despite running a rails monolith, GitHub is still a complex distributed system with many moving parts.

I feel like it's usually configuration changes and infra/platform issues, not code changes, that cause most outages these days. We're all A/B testing, canary deployments, and using feature flags to test actual code changes...

[^1]: https://github.blog/2023-04-06-building-github-with-ruby-and...


It's easier to blame <piece of technology> than to admit running a service at Github's scale is highly complex and takes deep expertise.


It's also that many Rails shops have performance problems: which isn't the same as saying "Rails is slow"!!. "Getting performance problems at some point" is almost a rite of passage in Rails; I'm certain every rails developer has pored over N+1 queries, caching, async jobs, race-conditions, gems and whatnot to keep the system running.

The only Rails projects that I worked on that never had performance problems are the ones that never reached any scale. All Rails projects that gained traction that I worked on, needed serious refactorings, partial rewrites, tuning and tweaking to keep 'em running. If only to tame the server-bills, but most of the times to just keep the servers up. Good news is that it's very doable to tune, tweak and optimize a Rails stack. But the bad news is that every "premature optimization is the root of all evil" project made a lot of choices back in the days that make this nessecary optimization today hard or impossible even.

What I'm trying to say is: Performance issues with Rails will sound very familiar to anyone who worked seriously with Rails. So it's not so strange that people reach for this conclusion if almost everyone in the community has some first-hand experience with this conclusion.


> The only Rails projects that I worked on that never had performance problems are the ones that never reached any scale. All Rails projects that gained traction that I worked on, needed serious refactorings, partial rewrites, tuning and tweaking to keep 'em running.

You'll be hard pressed to find any stack that doesn't require this.


Obviously.

A big problem with rails, though, is how easy it makes it to "do the bad thing" (and in rare cases, how hard it makes it to do the "good" thing). A has_many/belongs_to that crosses bounded domains (adds tight coupling) is a mere oneliner: only discipline and experience prevents that. A quick call to the database from within a view, something that not even linters catch, it takes vigilance from reviewers to catch that. Reliance on some "external" (i.e. set by a module, concern, lib, hook or other method) instace-var in a controller can be caught by a tighly set linter, but too, is tough.

Code that introduces poor joins, filters or sorts on unindexed columns, N+1 queries and more, are often simple, clean-looking setups.

`Organization.top(10).map(&:spending).sum` looks lean and neat, but hides all sorts of gnarly details in ~three~ four different layers of abstraction: Ruby-language because "spending" might be an attribute or a method, you won't know, Rails, because it overloads stuff like "sort", "sum" and whatnot to sometimes operate on data (and then first actually load ALL that data) and sometimes on the query/in-database. It might even be a database-column, but you won't know without looking at the database-model. And finally the app for how a scope like top(10) is really implemented. For all we know, it might even make 10 HTTP calls.

Rails (and ruby) lack quite some common tools and safety nets that other frameworks do have. And yes, that's a trade-off, because many of these safety nets (like strong and static typing) come at a cost to certain use-cases, people or situations.

Edit: I realize there are four layers of abstraction if the all-important and dictating database is taken into account, which in Rails it always is.


I wasn't really blaming rails per say, if anything their main database mysql1 seems to pop up in their post mortems more then anything else


> Is it the legacy rails/mysql stack that is the largest culprit or is there systemic rot in the engineering org?

The culprit is change. Infra changes, config changes, new features, system state (os updates, building new images, rebooting, etc...), even fixing existing bugs all are larger changes to the system than most think. It's really remarkable at this point that Github is as stable as it is. It is a testament to the Github team they have been as stable as they are. It's not "rot" it's just a huge system.


I don't think you understand ops :), there's no 100% availability anywhere, so issues and degradation will always happen no matter what. https://sre.google/sre-book/service-level-objectives/

It's not rails nor MySQL, both proven good for years.


please permit me to indulge the most extreme example of what you just said.

"What do you mean the database is down after I loaded 500 TiB and indexed all columns? It's MySQL, Facebook uses MySQL a high scale for years without incident!"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: