Obviously this sucks and I feel bad for all of those affected, especially the people fixing this or who depended on Slack for their workflow.
But it's hard to overstate the dangers of over-centralisation like this, and I say it as a person who uses Slack professionally.
Maybe running your own Zulip instance isn't as sexy or has the same integrations, but at least you can have a person responsible for fixing it and get status updates, and ultimately: as much as Ops is a dirty word; being able to plan your downtimes can help a _lot_.
> I certainly remember downtimes, and netsplits bad enough that it would have made team collaboration impossible depending on who was connected to what.
But it certainly wasn't all the entire IRC network(s). Which is the parent's point. Slack is centralized.
It was a common practice back then to have one "server". You would be hard-pressed to find mentions of "high availability" as servers were a very scarce resource. Small networks sometimes repurposed existing machines that were serving other workloads and added the IRC daemon.
From my perspective as a business, I don't see a difference between my server being down and all of IRC being down. Either way my comms are down.
All I care about as a business is about frequency and duration of those outages. If Slack outdoes whatever my in house solution is, Slack wins in reliability.
You can make arguments about the net impact on society of everything being down, but from my perspective as an individual business, having everything down is a win was well.
If I explain to client that we have a delay because our in house service is down -- fault falls on our business. If I explain to a client it is because Slack is down, odds are they'll blame Slack, not us. This is especially true if they're impacted by the Slack outage as well or the outage makes news outlets.
Outages from services likes Slack, AWS, etc. that are wide spread are almost treated like acts of god. Clients seem to be far more forgiving for those sorts of issues in my experience.
> If I explain to client that we have a delay because our in house service is down -- fault falls on our business.
This is an under-appreciated competitive advantage "cloud" services offer - outsourcing blame. Somewhat similar to using staffing agencies as lawsuit shields.
No I don't think this argument holds water. Whose decision was it to use the centralized cloud service? Ultimately it's on you if your business can't respond to the customer.
If the decision appears reasonable in the context of your industry, you won't be blamed.
Phrases like "Nobody ever got fired for buying Cisco/IBM/etc." are tossed around fairly often. If you chose the expensive, gold standard for your use-case and got burned anyway, odds are it won't be held against you.
Of course there are exceptions to that. If AWS going down results in loss of life -- you should take measures to avoid it. If the stakes are a day of lost work, the cure might be worse than the disease from a cost perspective.
It's not about whether it makes perfect sense, it's about how understating your users are. People treat large cloud service outages like storms. When AWS has issues people aren't mad at me any more than they would be if they couldn't reach my brick-and-mortar store because of a flood.
Will Slack actually outdo your in house solution though?
If you have a service (Slack alternative or otherwise) locally hosted, you're cutting out a whole bunch of failure modes:
1. Upgrades/maintenance at inopportune times (you can control this, or not even upgrade at all if you don't need it)
2. Networking issues, if you can colocate with with upstream or downstream dependencies
3. Many scaling issues (disk fills up, etc) since you're only serving your own needs and you're not trying to put as many users in as little capacity as possible
4. Relatedly, abusive use by other users won't effect you.
5. You can keep the service off the public internet and avoid things like DDOS attacks
It's nice to do able to blame someone else, I get, but that seems kind of gross, and actually being able to provide the best service overall is what I'd go for.
If your service is off the public Internet, you now have a failure mode where the company VPN going down kills all productivity. And if you're going to have issues with a disk getting filled up, that's most likely going to happen with an on-premise solution than with a hosted solution where they have a team to take care that the disks don't "fill up". And not keeping your software updated because "an update would happen at an inappropriate time and we don't need it" sounds like a good vulnerability vector.
Happy to do the update at a quiet time (say 3pm on a wednesday before we go to the pub), but wouldn't do it during a major event (say the middle of the world cup final when we're monitoring things very closely)
> If I explain to a client it is because Slack is down, odds are they'll blame Slack, not us. This is especially true if they're impacted by the Slack outage as well or the outage makes news outlets.
My perspective is skewed b/c I work at a place that invests 8-9 figures in their supplier risk management program, but in that domain you don't get any breaks here. This dependency is discovered during pre-contract due diligence and your plan and test history to continue operations in the event of a vendor outage is assessed in the overall risk profile of your company.
It's obvious this type of consideration in risk management is picking up steam and I'm expecting an industry to start forming around it. I'm also waiting for 'chaos engineering' features to start creeping into SaaS products.
Why does it matter to me if all of slack goes down at once?
I only interact with "my" slack. There is no material difference to my team's experience if "all of slack" goes down vs "our standalone chat server" goes down.
The only relevant issue is whether or not Slack's team creates a more reliable service than my poorly maintained homegrown solution.
I think that's a fair conclusion honestly, though the cynic in me does want to point out that people do integrate their work against slack and that is difficult to maintain in parallel.
Netsplits are mostly a solved problem these days, both with increased reliability of the internet and with network failure proxies like RobustIRC (for the linking protocol); and I remember fondly the times where we had a single IRC instance for all of our company (which had people working across: Thailand, USA and N.Europe) without issue.
But, yeah, I think what you said is completely fair.
I mean I've seen weirder. For Slack specifically I had a fight with a dev team because their deploy tool failed if it couldn't push a message to Slack.
> at least you can have a person responsible for fixing it
At what cost? Even with the occasional outages, Slack is still more cost effective than having your own person (or team for larger organizations) handle your communication software.
Not only that, but there's no guarantee that a local ops team would actually be any more successful, on average. I'm begging our IT folks to give up on hosting Jira ourselves because they can't make it work well. Random outages, misbehavior, and at the best of times it is sloooooooow. Just use the hosted version and let someone else scale the infrastructure to support it.
(or better yet, don't use Atlassian, but that argument's a non-starter at my company).
We have internal jira, it's fine. It's Jira sure, I personally don't like the keyboard shortcuts (start typing when I hadn't clicked the box and it "does things"), and I'm sure some people set up horendous workflows, but it's fast and secure, upgrades are telegraphed well in advance, and it's far more reliable than say our SSO system
Give your IT folks some slack. (hah) We use the hosted jira and it is slow and has random misbehaviours too. Altough can’t remember any complete outages.
I've never used the hosted version, my product manager has and he swears it's lightning fast. Maybe that's just perspective, though, since the version we host ourselves often takes 60 seconds to pull up an issue (and at it's best, it's still 10-15 seconds). Even a slow SaaS could be considered lightning quick compared to that.
I guess that depends? The company I mentioned in another thread had a single IRC node that basically never died.
For 8 years (the life of the company) that machine sat by and did it's job without problems.
Was it patched? No, and that's annoying, but that cost literally _nothing_ to the company (except a sip of power).
Zulip isn't bad, it definitely doesn't need a whole person to manage it, probably 1hr/mo is good enough.
How much you pay for an hour (or even 10!) a month of engineering time is almost certainly less than you'd be paying for Slack if you're more than 100 people; especially at their single-sign-on provider pricing.
...and the moderators who resigned immediately set up their own network under their own terms (libera.chat), so I'm not really sure what your point is? The difference is that you can jump to another server with an IRC client, you can't jump to another Slack with the Slack client.
Trying to figure out where a github outage actually really impacted us in any meaningful long-term manner in the past five yeears. maybe we couldn't ship code for a couple hours, but all those PRs still did get out in an eventual timely fashion? I suppose there might be some people who had a service outage at the same time and their restrictive pipelines could only take tweaks via github, which would suck, but I'm guessing a majority of people, it's not worth hosting and maintaining all this infra yourself?
Github is really great for discoverability and the open source ecosystem, but I agree that depending on it for issues, pull requests, releases and the like is a bit sad for decentralization.
I don't really know the solution. I know Fossil [1] has bug tracking, wiki, forum and now even a chat, but I've never used it and use Github instead. Maybe adding issues and things like that to git itself would be a solution? I don't know what's the percentage of people using git the "Github" way (with at least issues intergrated into it) but if it's really high maybe it's time for git to follow usage?
Honest question: how often do you discover a project on GitHub? For me the answer is never. It's not like it has an interface for searching or browsing in any meaningful way. Usually you do a web search for some appropriate terms, find the project's main site, which will then have a GitHub link.
There is no "people who imported this package also imported..." type functionality, and I don't think it would be useful if there was.
You can still have multiple origins or email patches if you really need to commit here and now.
At least you can still work when your Git server goes down. And if your Git server is GitHub, you can keep working on your code and not have to go play sysadmin with your Git server.
> That gets access to the repo, but not issues, actions, code reviews, etc etc.
(to the extent that it's external to git, e.g., housed in GitHub rather than the repository), I think we're no longer talking, as your grandparent was, about:
> the most trivially distributed and self-hosted technology ever written ….
I agree that centralization is a real risk, but I don't buy for a minute that your self-hosted Zulip instance is apt to have lower unplanned downtime than Slack.
What practices do you use to ensure higher availability than Slack? Hope is not a strategy.
(Answering my own question, I'd guess that the strategy is "never change anything". To some extent, I don't need 80% of the features that Slack offers, and code that you don't have is code that can't break. Not restarting stuff to release new code is also good for uptime. Would be interesting to see a SaaS that has a super-stable-never-change-anything release track, hosted on infra with similar goals, and see how that uptime compares to people's homegrown version of the same strategy.)
My public facing servers (well https with client certificates that access them) are constantly upgraded, there's two of them, occasionally one goes down for a reboot overnight to get a new kernel, but clients will point to the other one.
Radius does vrrp, not instant, but acceptable
For true high availability (when downtime of milliseconds being unacceptable), we have A and B systems (sometimes more). E.g. syslog - router sends the messages out of two interfaces to two different servers which don't go down at the same time.
For local staff wanting access to internal webpages, we can cope with 20 second downtimes at an appropiate time.
I don't run a local irc server, but I'd be very surprised if it was any less stable than any other standard linux box -- do an apt or yum upgrade occasionally, any service outages you plan during a team meeting (thus we don't care if the chat system is down as we're talking via another method).
Stability is far easier when
1) Your server isn't open to the whole world
2) Your server isn't scaling to millions of users
Hell I've got a tcpdump session on one box that's been running since before the last slack outage 3 months ago.
Awful I'm sure, but doesn't need to be, all it does it run a lot of snmp queries on an isolated network, store the data, and serve static webpages to via a proxy
And slack has been breached before. So this is a weird argument to make. The point being made is that you can determine your own trade-offs. With slack: you simply can’t.
Going a year without any sort of outage is. For example, if you're just serving from one place, any network outage between the users and that place will be an outage -- seamless failover is pointlessly hard and very few shops do this for their internally-hosted services.
Depends how you define a network outage. One of my transatlantic circuits has been a bit ropey - had a 2 second outage on it a couple of nights ago, does that count as a network outage?
If a user has a local network problem, that's not a loss in the service. If multiple users do that's an issue. My home broadband went out for 3 minutes at 23:33:43 GMT on May 13th, but then I wouldn't run a service needing 24/7 on my home broadband.
My frickin nasdrive has a better uptime than slack.
Why? My central IRC node at a previous job had a better uptime than slack, and my previous company Gitlab server has lower unplanned downtime than Github...
The key is that when I do my maintenances (on gitlab, for example) I can take the thing offline to do my work, which makes it much easier than trying to canary a change.
It means I can plan a rollback or keep the backup in place.
It's not that I'd have a higher uptime than slack really (though, I have experienced this) it's that the downtime works for the company and if it ever did fail we have the knowledge to fix it on our own schedule.
I know that's scary to some, but it's a comfort to others.
> But it's hard to overstate the dangers of over-centralisation like this, and I say it as a person who uses Slack professionally.
I think you're overstating the importance of a chat based application. It's not mission critical to operate a software business or any business for that matter so why is this so "dangerous"?
> Maybe running your own Zulip instance isn't as sexy or has the same integrations, but at least you can have a person responsible for fixing it and get status updates, and ultimately: as much as Ops is a dirty word; being able to plan your downtimes can help a _lot_.
Let's play the devil's advocate and say it IS mission critical. Now you need to pay full time staff (read -> not cheap) and computing resources and integration development (depending on solution) for it to be useful AND guarantee uptime.
I don't understand what the big fuss is about Slack vs IRC. Pretending as if every business in the world has access to competent sys admin folks is not only foolish, but it's dangerous to assume.
> I think you're overstating the importance of a chat based application. It's not mission critical to operate a software business or any business for that matter so why is this so "dangerous"?
I feel like the truth is somewhere in the middle here. In an increasingly remote world and with the many ways slack integrates into businesses I think it is incredibly important to some businesses and maybe even mission critical.
But I totally agree with your second point that just because chat is important doesn't make running your own IRC instance a good decision or resource investment.
Actually, the synchronized downtime means that everyone is down at the same time so that's way more preferable to having your own independent downtime. There's no coordination loss because everyone is explicitly off-grid for this period. The common signal is advantageous. And if you're a vendor, then your customers are sympathetic if they're also suffering at the same time.
It might be hard, but I think you've overstated the dangers of overcentralizing here. If slack is down for two hours, what's the worst that could happen? Everyone works on something on their own and sends an email or uses one of 50 video chat apps if they need to talk. I think occasional downtime is expected and acceptable.
We’ve ran our own mattermost server for almost 5 years. Sits alongside our hosting environment and not had any outages, except maintenance. Some staff moan it’s a “poor mans slack” but it works well, keeps everything in our platform and virtually free.
Will your self-hosted solution have better or worse downtime than Slack? Surely that’s the better question than whether something like this is centralized or not.
> Maybe running your own Zulip instance isn't as sexy or has the same integrations, but at least you can have a person responsible for fixing it and get status updates, and ultimately
Let's replace something that is almost never down don't require any maintenance (Slack) with something that requires maintenance and will probably go down more frequently than Slack. I hardly see any benefit in term of availability going on Zulip
> that requires maintenance and will probably go down more frequently than Slack
Been running Zulip for two years at two companies. Never gone down, not even during maintenance (zero-downtime reloads FTW), which was twice, for version upgrades.
> don't require any maintenance (Slack)
Neither does zulip.com, which is a managed service by the makers and maintainers of Zulip and costs the same as Slack.
My previous company had IRC as a backup in case their primary form of communication goes down.
You get the benefits of advanced communication systems with the ability to continue to work in an outage, and while it would be nice for everyone to have IRC, only ops folks really need it to sustain operation.
I’m a lowly SRE, and I’m not technical enough to use IRC effectively despite on and off attempts spanning more than a decade. Believe it or not, I’m not the least technical person in our org! This, I think, is the biggest reason my company, and presumably many others, don’t use IRC, never mind that running your own IRC or Zulip or whatever doesn’t inherently gain you any extra uptime and in all likelihood costs you more uptime (who is going to be better able to support your application, your Ops team with dozens of other critical applications to worry about and no inherent expertise with the application, or the dedicated team with deep expertise and a reporting structure all the way up to the CEO who are singularly aligned with the goal of running this application? Note: this argument tends to apply to any managed-service vs self-host debate. Note that we wouldn’t have a debate at all if every Zulip/IRC/etc outage hit the front page of HN. :)
Our data centres have two power feeds, UPS on one of the legs, and backup diesel generators.
Network is resilient via fibres in different directions.
My laptop has a backup UPS built in. If wifi breaks, I can tether off my phone.
Power, ISP, UPS and Generators have all failed, but none are as ropey as slack (At least in the west - we had power issues in our Kabul officer earlier today, although even then slack has more outages than that)
If we’re going to extremes: why not outsource having a CEO? I mean. A company that gives you a managing director or chief executive can do so with far more availability than just one guy!
But really. Your statement is absurdist, critical infrastructure doesn’t have to be outsourced. It’s a choice and there are trade offs.
Hey I just want to toss out kudos on your podcast. I found it through someone linking to it on a recent HN story and have really enjoyed catching up in the catalog. Super cool idea and really beneficial.
One website suggestion, add your transcripts to your search index.
Heh, and the last time the podcast published an episode about an Auth0 outage, Auth0 had another outage the next day. Now Slack follows shortly after an episode. How eerie!
I wonder if any companies have experimented with intentional occasional slack 'outages'?
Maybe 1-3pm daily?
I've seen some of the Gitlab Gospels that describe how to not use slack or whatever messaging tool they use in excruciating detail, and it made me think...
...if it requires this much instruction on how _not_ to use it, then maybe there is a tool that is a bit more intuitive.
We teach children how to use their voices all the time (ex: "indoor voice"), rules of social interaction and so on. Beyond a certain point, you need social rules for social creatures.
Shutting down messaging as a cultural break would be fine... if Slack wasn't also the searchable knowledge base of record for nuanced specifications, customizations, and relationships with business partners. When it becomes that, it becomes critical infrastructure!
I recognize the ways in which this is a difficult problem, but it remains frustrating to me that the application isn't capable of providing this information directly to end users. Instead you end up with confusion, missed communication, and occasionally people thinking they've been fired! Surely we can do better.
I am sad to see that our profession hasn't figured out how to make a decent desktop app with local-storage (that continues working at least in minimal fashion when the server is out of commission).
Slack, Teams, they are all similar. Many times I open the Teams window back from a coffee break and I get a blank page. The software can't even detect that it is a blank page being displayed and some progress ball / beach ball whatever.
I say this in the middle of attempting a Tauri desktop app. I am sure my app will have the same problems, but my other options are even worse.
Long time ago, the day I finished my probation period at work was also a multi-hour gmail outage. I got up, couldn't load my email, and had a panic attack for a few minutes before finding out what was going on
Slack was working for me this morning (desktop and mobile), albeit a bit slow, but once I hit Command + R, it gave me "Server Error something went wrong" too.
It logged me out of my work's workspace from both desktop and mobile device so I can't communicate with my team.
Don't press Command + R/Ctrl + R, if Slack is working for you!
I started using slack about two years ago, and in my opinion they have a quality problem. There are just too many issues that crop up when they roll out change.
It reeks of the "just get it done" anti-pattern where done is change that hasn't had the chaos tracked down and killed.
I bet internally they are waiting for bugs to get reported instead of pro-actively running the changed software under representative load to hunt and kill chaos.
In a corporate environment I'd get absolutely brutalised by lawyers for doing this. The amount of invasive data collection, fingerprinting, phone number requirements, permanent plaintext logging of messages, file scanning, process scanning and upload...
> Slack seems superfluous now that MS Teams is consuming everything before it.
MS Teams suck so bad from a developer perspecrive compared to slack it's never going to be a proper alternative. I'm not dissing, this is just my experience using both
Obviously this sucks and I feel bad for all of those affected, especially the people fixing this or who depended on Slack for their workflow.
But it's hard to overstate the dangers of over-centralisation like this, and I say it as a person who uses Slack professionally.
Maybe running your own Zulip instance isn't as sexy or has the same integrations, but at least you can have a person responsible for fixing it and get status updates, and ultimately: as much as Ops is a dirty word; being able to plan your downtimes can help a _lot_.