Slack’s Incident on 2-22-22

58x14 · on April 26, 2022

Just my personal take, I think this is a really well-written incident postmortem. It's specific, extensive, candid, and dare I say, entertaining?

Many incident reports are fully lacking in any meaningful detail, or wholly unapologetic. I actually enjoyed learning tidbits about the author, in particular their mention of https://how.complexsystems.fail/.

Reading this boosted my confidence in Slack's teams, which should ultimately be the objective of a release like this. It's not pure PR nor a gruff legally-obligated disclosure.

It helps that I wasn't really affected by this incident.

esses · on April 27, 2022

The named authors on the blog post are an amazing set of engineers and the type that would have added a lot of introspection and expertise to what happened while maintaining a high-level of lightheartedness. I spent a number of years at Slack and was involved in a postmortem in my first week as an Engineering Manager. I was impressed by how excellent practices were borrowed from Etsy in the early days and then magnified by Google's practices. As someone who once had to run these sessions and distill the learnings in to the wee hours of the morning this is great to hear.

vxNsr · on April 26, 2022

The fact that the current top comment thread is quibbling about the date format in the title seems to agree with this assessment, if there was anything real to complain about that’s what we’d be seeing, instead we get bikeshedding on the date format in the title of a post.

Wistar · on April 27, 2022

[adds "bikeshedding" to vocabulary.]

esses · on April 27, 2022

This is awesome because we had a bikeshedding image floating around Slack's Slack instance that was posted with enough frequency that I remember it.

jonah · on April 27, 2022

Today is your lucky day!

https://xkcd.com/1053/

VWWHFSfQ · on April 26, 2022

Well that commenter's username is 'diarrhea' so...

dijonman2 · on April 26, 2022

Which is aligned with US date formatting.

philderbeast · on April 27, 2022

which only makes it confusing for the rest of the world

JohnTHaller · on April 27, 2022

It's not that confusing if you know there are only 12 months. If it happened to occur on 2-4-22, then it would be pretty confusing.

8n4vidtmkvmk · on April 27, 2022

its particularly bad because US dates are written with slashes. please leave dashes for ISO so we have some hope of understanding anything.

dijonman2 · on April 27, 2022

Just because you don’t seem to like it does not mean the US date format is invalid. It should be followed while in the US.

philderbeast · on April 28, 2022

Or you could use a sane, unambiguous date format regardless of where you are in the world like 22 Feb 2022.

dijonman2 · on April 28, 2022

This is devolving into tribalism.

ttul · on April 26, 2022

On today’s internet, a frank post mortem delivers value to customers and PR gold too.

psanford · on April 27, 2022

Laura Nolan is a gem that Slack is lucky to have.

michelb · on April 27, 2022

I think a well written post-mortem is an excellent recruiting tool.

dormando · on April 26, 2022

Now a more philosoraptor style comment: I see Mcrib is a service built to quickly detect and replace memcached's. I treat memcached in infrastructure as a very stable service. Meaning it is infrequently necessary to upgrade it, and it will generally not fail on its own. If it does it will be highly infrequent compared to services with higher churn or more complexity/dependencies. This means if they're failing often enough that you need to rapidly detect and replace them you have a more fundamental problem.

From a structural standpoint I think my technical comment can be useful. If things really are failing this much A) you should figure out why and slow that down. B) if you have a generally stable system and understand the typical rate of failure, you can add tripwires into Mcrib to avoid over-culling services and loudly raise alarms. Then C) you can improve technical reliability with redundancy/extstore/etc.

I've also seen plenty of times where folks have a dependency of a service determine if that service is usable, which I disagree with quite strongly. Consul being down on a node should trigger something to consider if the service is dead. It's important both for reliability (don't kill perfectly working things because you end up having to design around it), and for maintainability as you've now made people afraid of upgrading Consul or other co-dependent services. Other similar failures are single-point-of-testing availability checking where instead you probably want two points of truth before shooting a service.

Now you risk people being afraid of upgrading probably anything, which means they will work around it, abstract it, or needlessly replace it with something they feel safer managing. The latter is at best a waste of time, at worst a time bomb until you find out what conditions this new thing breaks under.

This isn't advocating that you design without assuming anything can fail anywhere at any time; just pointing out that how often a service _should_ fail is extremely useful information when designing systems and designing fail safes, alerts, monitoring, etc.

ransom1538 · on April 27, 2022

"I treat memcached in infrastructure as a very stable service."

I run memcached at a large scale. You are totally right. Every other year we will find ONE bad memcached node down. We use nutcraker instead of mcrouter for consistent hashing to each memcache node. Once i read "We also run a control plane for the cache tier, called Mcrib. Mcrib’s role is to generate up-to-date Mcrouter configurations" -- I was like oooooh boy, here we go....

Knowing memcache is a rock comes with experience though.

iamcal · on April 27, 2022

Our underlying hardware (AWS) is nothing like this reliable. We see regular (several times a year) failure of racks of machines or whole DCs.

Across the whole fleet (all services), we lose 1-10 servers per day as a baseline. Major events are then on top of that and can impact thousand of hosts at once.

ransom1538 · on April 30, 2022

What service is this?? This must be huge.

muglug · on April 27, 2022

> I run memcached at a large scale

I don't believe you run it at the scale Slack does.

The people at Slack who decided to use Mcrouter (and created Mcrib) have experience running Memcached, Mcrouter and Nutcracker in production at two of the biggest web properties in the world.

Trust that they know whereof they speak.

tempest_ · on April 27, 2022

You may not be wrong, in fact you are very likely right, but this is not an argument.

The larger an org gets the more likely it is to do weird things to mitigate organizational difficulties be them budget, human or otherwise.

Those types of things rarely show up in postmortems for obvious reasons.

ransom1538 · on April 27, 2022

"I don't believe you run it at the scale Slack does."

Definitely not. We host about %80 of elementary schools in the US. Not slack scale but definitely face many of the same issues :/

tuetuopay · on April 27, 2022

I think you nailed the real issue that caused the incident: saying "consul down == unhealthy memcached", then evicting the node. If Mcrib instead did some actual applicative healthchecks (e.g. memcached ping), which could be correlated with some system metrics (cpu, ram), it could avoid evicting those perfectly good nodes with a warm cache that just happen to have a restarting consul agent.

Granted, this is easy to say once the incident happened with an excellent postmortem, but this should be an industry-wide wakeup call: don't do this.

I have the same issue at work, where people treat a "prometheus node_exporter down" as a "the app on the machine is down". I've started to add the actual app name in our alerts, and now people don't freak out anymore when they see "down" alerts: oh node_exporter is down, but not the app? Don't panic and calmly check why.

bognition · on April 26, 2022

It’s likely that the memcached install is so large that the underlying instances themselves are failing. When you have hundreds or thousands of instances, failures in the instances themselves become pretty regular.

_lqaf · on April 27, 2022

I don't see this. I have thousands of long-lived instances - full VMs, not containers, running in our hardware.

If they start "going bad", something is wrong. That's a signal I wouldn't want to ignore.

It has happened - once an HBA in a storage node was causing occasional corruption, another time due to a communication failure people were building things with the wrong version of something which had a memory leak and would eventually summon the OOM killer. There have been other issues.

"Have you tried turning it off and back on again" is still a terrible system management strategy.

bognition · on April 27, 2022

Failure rates in AWS are probably higher than what you're seeing in your own hardware.

_lqaf · on April 27, 2022

Maybe. If you don't look, you don't know.

But given the number of people I've heard using "we're on AWS, out of my control" as an excuse, this appears to be an unofficial service they offer.

dormando · on April 26, 2022

I can say with certainty this isn't strictly true. The failures should be relatively rare; when I say relatively I mean on the level of natural node failure. If natural node failure isn't survivable without special systems to quickly replace downed nodes you don't actually have an N+1 redundancy system. Thus, the pools aren't large enough :) Or, in this case, if they really are failing this much then having them always lose their cache is a major reliability hole.

It's a subtle difference. I think many operators get used to node failures being extremely common when they don't necessarily have to be. I suspect the note on "if they come back on their own ensure they're flushed" meaning they have something unusual causing ephemeral failures. If that's just "cloud networking" there isn't much they can do but it's almost always fixable.

xyzzy_plugh · on April 27, 2022

> The failures should be relatively rare; when I say relatively I mean on the level of natural node failure.

And exactly how rare do you believe this to be?

In my experience, node failures at scale of hundreds to thousands of nodes are monthly to weekly, if not daily. Generally speaking, stability is a normal distribution. Young, new instances experience similar failure rates as old instances. If you have any sort of maximum node lifetime (for example, a week) or scale dynamically on a daily basis then you'll see a lot of failures.

dx034 · on April 27, 2022

Which still means you could implement a hard limit of 1 fail per hour and only allow more replacements with manual intervention. With a thousand nodes, several or hundreds failing within a few hours is so unlikely that you're probably better off preventing automatic failover in these cases.

But that generally mirrors my experience that automatic failover for stable software tends to cause more issues than it solves. A good (i.e. redundant hardware and software) Postgresql server is also so unlikely to fail that wrong detection and cascading issues from automatic failover are more likely than its actual benefits.

xyzzy_plugh · on April 27, 2022

I think you're looking at it the wrong way. A server is never just postgres or memcached, there's always other stuff running, and it's that other stuff that can cause problems. Like maybe you're patching the fleet and a node fails to come back up, or due to misconfiguration the disk gets full.

I'd argue that stable systems are actually worse for operational stability as you become complacent and comfortable and when shit hits the fan you are unprepared.

sandGorgon · on April 27, 2022

more likely - they are using "spot instances" for memcached, which will cause them to be evicted fairly frequently.

prescriptivist · on April 27, 2022

Or horizontal autoscaling based on demand.

dormando · on April 26, 2022

Hi! I'd like to offer some hopefully useful information if any Slack folks end up reading this, or anyone else with a similar infrastructure. I'll start with some tech and make a separate philosophical comment.

Also caveat: I have no deep view into Slack's infrastructure so anything I say here may not even be relevant. YMMV.

First some self promotion: https://github.com/memcached/memcached/wiki/Proxy memcached itself is shipping router/proxy software. Mcrouter is difficult to manage and unsupported. This proxy is community developed, more flexible, likely faster, and will support more native features of memcached. We're currently in a stabilization round ensuring it won't eat pets but all of the basic features have been in for a while. Documentation and example libraries are still needed but community feedback help speed those up tremendously (or any kind of question/help request).

It's not clear to me why memcached is being managed like this; mcrouter seems to only be used to abstract the configuration from the clients. It has a lot of features for redundant pools and so on. Especially with what sounds like globally immutable data and the threat of cascading failures during rolling upgrades it sounds like it would be very helpful here.

If cost or pool sizes are the main reasons why the structure is flat, using Extstore (https://github.com/memcached/memcached/wiki/Extstore) can likely help. Even if object value sizes are in the realm of 500 bytes, using flash storage can still greatly reduce the amount of RAM necessary or reduce the pool size (granted the network can still keep up) with nearly identical performance. Extstore takes a lot of tradeoffs (ie; keeping keys in RAM) to ensure most operations don't actually write to flash or double-read. Extstore's in use in tons of places and everyone's immediately addicted.

Finally, the Meta Protocol (https://github.com/memcached/memcached/wiki/MetaCommands) can help with stampeding herds to help keep DB load from exploding without adding excess network roundtrips under normal conditions. I've seen lots of workarounds people build but this protocol extension gives a lot of flexibility you can use to help survive degraded states: anti-stampeding herd, serve-stale, better counter semantics, and so on.

pierrebai · on April 26, 2022

Sometimes new roll-out causes outage, sometimes, roll-out are delayed due to the overall system architecture. Reading the post-mortem, I could not help but be reminded of this issue as described here: https://www.youtube.com/watch?v=y8OnoxKotPQ

anonu · on April 27, 2022

One of my favorite videos. And yes, slack is way more complex than I could ever imagine

throwdbaaway · on April 27, 2022

> Because the GDM data is immutable and thus can tolerate staleness, the query was also updated to read from replicas as well as Vitess primaries.

Any data that is fronted by the memcached caching tier can tolerate staleness right? There is not much difference between a short TTL of 1s versus an async replication delay of 1s.

Abusing vitess primaries is the root cause of this incident, and a similar incident can happen even without any scatter query.

epmatsw · on April 26, 2022

McRib is a hilarious name for a service

whoopdedo · on April 26, 2022

I certainly wouldn't trust it's availability.

(The original McDonald's McRib sandwich is well known for only being sold a limited time.)

So "Mcrouter" comes from Memcache-Router, then the obvious McDonalds jokes are made and someone cleverly suggests "Mcrib" for the next service. But I can't think what the backronym would be for it. Memcache Ring Buffer maybe. Or Broker.

jonah-archive · on April 26, 2022

RIB is a common term in networking for "Routing Information Base" (being the set of all routes which could be chosen to be installed in the routing table (or FIB -- "Forwarding") by the control plane. I don't know that this is the actual etymology but it's not implausible.

bee_rider · on April 26, 2022

Apparently it generates configurations for Mcrouter. Could be MemCache-Router Instance Borker.

true_religion · on April 26, 2022

I think you meant Broker, but the misspelling is an act of genius since we are talking about downtime caused by an infrastructure failure.

qubyte · on April 26, 2022

And I misread it as McBorker and now I can't stop chuckling.

erichurkman · on April 26, 2022

McBorker, Chaos Monkey's cousin.

bee_rider · on April 26, 2022

Does it make it less of an act of genius if it was intentional?

true_religion · on April 27, 2022

No it’s be more an act of genius then. I think I meant something like serendipity.

phan · on April 26, 2022

memcache router interface broker

JetAlone · on April 27, 2022

They had to know when they picked that name. If workers at Slack actually pronounce it like "Ehm See Rib" or, forbid it, "Ehm See Ahr Eye Bee" and not "McRib", I have very little interest in working there.

oxfordmale · on April 26, 2022

Which major outage ? According to the Slack uptime, there was barely 1.5 hour of outage :-)

P.S.

Yes I know the uptime is decide by committee, and doesn't reflect reality. I am just being cynical.

hw · on April 27, 2022

1.5 hours for a tool like Slack is major. Lots of productivity lost (or gained depending on how you view Slack) and thus $ impacts at companies that heavily rely on Slack for internal/team comms

shp0ngle · on April 27, 2022

tell that to Atlassian

belter · on April 26, 2022

A Date that is both a Palindrome and an Ambigram:

https://www.jagranjosh.com/general-knowledge/22-02-2022-is-b...

sva_ · on April 26, 2022

Well, not the way they format it.

rossdavidh · on April 26, 2022

Good writeup, although the reference to https://how.complexsystems.fail/ was even better.

olliej · on April 26, 2022

I find reading about these incidents super interesting, and I generally find the work performed by the folk keeping these service running (and dealing with the inevitable falling over of any computer system).

At the same time it seems like a horrifying job I would never ever want :D

perydell · on April 27, 2022

This is very transparent and a good write-up. I wonder if someone at Slack could explain how they calculate their downtime on their status page. This outage was for 3 hours and 14 minutes but they claim 99.79% uptime for the month of February.

selcuka · on April 27, 2022

It is possible that approximately 50% of all users were affected, or the requests timed out 50% of the time.

mescaline · on April 26, 2022

An over communication platform should have scheduled outages like this regularly!

alex-zierhut · on April 26, 2022

What motivation would someone have to run a scheduled outage? I can't think of any.

AndrewUnmuted · on April 26, 2022

[flagged]

jdlshore · on April 26, 2022

This sort of handwavy conspiracy thinking is distressingly common. What basis do you have for your suspicion? Is it just “big company bad”?

mulmen · on April 26, 2022

I choose to believe it is not common thinking but instead commonly verbalized among the minority with such thoughts.

jdlshore · on April 27, 2022

You're probably right. I've noticed a trend of people fantasizing out loud in the past 3-5 years. It's nearly always cynical / conspiracy-theory fantasies, and they're stated as fact (or near-fact, as GP did), but without any backing information or logic... just pure fantasizing.

But, thinking back, it's not that common. Hopefully just a fad. People expressing their frustration at inequality and fear of corporate dystopia.

AndrewUnmuted · on April 26, 2022

[flagged]

alphabettsy · on April 26, 2022

> engulfs the worker inside a dead-eyed grunt culture, featuring an endless spree of work-life balance destroyers. It might be great for people who ask for things from others, but for the people who have to actually do the thing being asked of them, Slack is a nightmare world.

I think it depends on the organization and how you use it. In a previous role I would’ve agreed with you. People expected you to reply at all hours, where I am now that isn’t the case.

Tools do not create toxic culture or destroy work-life balance. Organizations do that.

AndrewUnmuted · on April 29, 2022

Nah, nothing can save a work culture when it utilizes Slack for anything. I have been on both sides of the relationship in the Slack-based work society. It always ends up yielding the same outcome, no matter if I am a manager, an individual contributor, or even an outside contractor.

There are always going to be a couple of people in your unit who run the show, who make all the animated GIF posts, to make all the slackbots that don't do anything. They always use Slack to gain undue notoriety within their firm. Slack caters to these people because these people tend to have the controls over the purse.

It's a terrible thing to do to a business, to force Slack upon them. You cannot blame Slack for making this dang product, but you sure can blame the people who consume & purchase the service without even once thinking about the well-being of the employees.

gwill · on April 29, 2022

I can't say ive seen any of what you’re describing despite using slack since the early days across several organizations. I know that my experience is just that, but you might want to do a little introspection. this sounds very pessimistic and a bit paranoid.

Robin_Message · on April 27, 2022

I am confused by the query that had the problem. Specifically, I am confused by why the sharding is done by user id.

Even the largest Slack instance probably has under 100,000 users and less than 1000 peak messages per second. That feels to me like it could be served by a single master DB. It feels like that would be a better way to shard.

Major downside is the major difference in shard sizes so some management/migration might be needed but it seems doable to me.

Certainly it feels naively that scaling should be easy due to the way slack instances are independent (unlike say Twitter).

sophiebits · on April 27, 2022

Workspaces are not completely independent, such as for the Slack Connect feature. Some more details here: https://slack.engineering/scaling-datastores-at-slack-with-v...

Robin_Message · on April 27, 2022

Thanks! Indeed, the variability of size and usage of each instance was the big issue, and then doing features that crossed instances meant they'd be crossing shards whatever they did, so it made sense to fix the variability issue.

(I'm also surprised/reminded how fast Slack grew and how quickly it became effectively ubiquitous — I think every company I've contracted for in the last five years has used Slack).

iamcal · on April 27, 2022

> Even the largest Slack instance probably has under 100,000 users and less than 1000 peak messages per second.

This is not true, by an order of magnitude.

notacoward · on April 26, 2022

I love the diagrams of the cache<->DB cycle in normal vs. degenerate states. Those illustrate the problem very clearly and succinctly, and I hope they make it into a textbook some day. Kudos.

neerajk · on April 26, 2022

“Mcrib is objectively a better system for generating memcached configurations — but its efficiency made the broader system behave in a less safe way.” Be good but not _that_ good :)

manesioz · on April 26, 2022

Great post mortem, I love reading these. Its pretty neat that the tech industry is relatively transparent about these situations -- we all benefit from learning about them.

di4na · on April 27, 2022

As someone that spend far too much time in incident report from all kind of industry, you made my day.

I am really happy you think so. But no. We are really not transparent about them. At all.

_justinfunk · on April 26, 2022

I'm looking forward to http://thisincidentreportdoesnotexist.com launching sometime later this year.

ncmncm · on April 27, 2022

Which was, notably, Tuesday.

diarrhea · on April 26, 2022

That date format is actually the worst I have ever encountered. m-d-y, with year in 2 digits, numbers not zero-padded, US "order" yet using dashes. It's like a moderator of /r/ISO8601 came up with the worst possible format on purpose. Am I missing something?

jonpurdy · on April 26, 2022

Came here to complain specifically about this. 2022-02-22 is unambiguous, big endian, and sorts nicely. IDK why society still uses any other date formats considering how international everything is.

gleenn · on April 26, 2022

It's because people for hundreds of years have been saying "March second, nineteen sixty two" which they then write out in that order. As a programmer, peoples' frustrations are understandable, but you're a bit naïve if you think even a percentage point of the speaking population of the world knows or is concerned with big endian-ness or sortability. However they speak English, at least in America, in that order, and that's the way they write it. Europeans only got it a little better.

spartanatreyu · on April 26, 2022

I'm an Australian who occasionally has video chats with Americans overseas for work and regularly plays D&D online at least once a week with friends all over the world.

The only time I've every heard someone say "<Month> <Ordinal>" or "<Month> the <Ordinal>" is when talking with Americans.

Every other time it's always "<Ordinal> of <Month>" or just "<Ordinal>" for short.

ascar · on April 26, 2022

> Europeans only got it a little better.

There is a reasonable argument for little endian dates (as in the least significant information is usually the most relevant as it changes most often), but apart from the "it has been like this forever" I don't see any reasonable argument for middle endian date formats. Then again, the US is notoriously resistant to the metric system too.

sverhagen · on April 26, 2022

So, European in the US, here. I switch my dates stubbornly to DD-MM-YYYY, 'cause that's the only way. Of course I would. But then there's so many US applications that don't adhere to my settings and use MM-DD-YYYY. So then I am still deciphering 05-07-2020-kind of stuff. All. The. Freakin'. Time.

:)

allannienhuis · on April 27, 2022

I sometimes format dates in documents or emails dd-MMM-yyyy when the audience is international. ie 2-Feb-2022 using the short month form disambiguates the fields and I think avoids mental gymnastics like 'what month is 09 again?' for the reader. (or in my case the finger-counting....)

dylan604 · on April 26, 2022

I frequently receive date data in Excel spreadsheets from the UK, but as a US user of Excel, I cannot convince it to interpret the date correctly. It is astonishingly bad at this.

mulmen · on April 26, 2022

European date format doesn’t make sense either. It still doesn’t sort and the little-endian date is composed of big-endian year, month, and day.

Putting the day first isn’t actually a benefit because you still need the year and month for context.

Any favoritism of the EU format is the same as the US format. Just familiarity.

ISO8601 is the only way.

sverhagen · on April 26, 2022

I agree that the European format is probably not more useful, and you probably convinced me to go change my settings to YYYY-MM-DD. But I _do_ think that the European format makes more _sense_, it being in chronological "magnitude" order.

mulmen · on April 27, 2022

I can understand that perspective, although I maintain the usage is only preferred because of familiarity.

Since you are already on the fence with ISO8601 I invite you to consider time of day. Would you use second:minute:hour? That is also in (reverse!) “chronological magnitude” order.

lilyball · on April 26, 2022

It's because it matches the way we speak dates aloud. When intended for human consumption, sortability and big-endianness doesn't matter, but matching the way we speak does. Maybe other cultures actually speak dates differently, I don't know, but I have never seen a native English speaker habitually speak dates any differently than "January 1st, 2001".

All that said, I definitely agree with the original complaint, m-dd-yy is an atrocious format. If you're going to use dashes, stick with yyyy-mm-dd. Replacing the dashes with slashes, as in 2/22/22, would have been fine.

ChrisKnott · on April 26, 2022

In the UK I think "1st of January" is probably slightly more common than "January the 1st" although you hear both. "January 1st" (no "the") sounds American.

ascar · on April 27, 2022

Given that so many (all?) other English speaking nations including the UK usually speak it the other way around and write it day-month-year, I wonder if writing (especially thinking about newspapers here) influenced the way you speak it and not the other way around. March 1st saves space and ink over 1st of March or some other rationale. Someone certainly has already investigated the origin of putting the month first?

Edit: [1] says my hypothesis is most likely wrong, but that the UK just changed it later to match the rest of Europe. So maybe that influenced their way of speaking? In any case, matching the way one speaks doesn't seem to be a strong reason as it's easily adaptable and month names are unambiguous. Interestingly it also quotes that using a purely numeric format is incorrect in any formal use as to not confuse month and day.

[1] https://iso.mit.edu/americanisms/date-format-in-the-united-s...

Symbiote · on April 27, 2022

Old documents from the UK use both formats. There could be a European influence, but it would be from 150 or more years ago.

http://anguline.co.uk/cert/birth.html

rmccue · on April 26, 2022

“the twenty-sixth of April” would be the way I say today’s date and anecdotally is in common usage in both countries I’ve lived in (the UK and Australia, both using d/m/y). I’d say it’s about as frequent as “April the twenty-sixth” by itself, and definitely more common if you include the day (“Tuesday, the twenty-sixth of April”).

jc_811 · on April 26, 2022

Oh this is a great point! I'd never realized that. I know that in Spanish (and I assume many of the romance languages) we always say the day first, eg dos de febrero (2nd of February). In American English even though the day first technically is grammatically correct, we pretty much never say it in that order (February 2nd instead of the 2nd of February)

verve_rat · on April 26, 2022

I'm from NZ and it is 100% normal to switch back and forth between "The second of March" and "March the 23rd".

People I have met from Australia, South Africa, the UK, all have the same flexibility.

selcuka · on April 27, 2022

I don't think how we speak dates is relevant. We say "hundred and fifty dollars" but still write it as "$150".

mumblemumble · on April 26, 2022

Your error is expecting reasonableness. All linguistic conventions are either arbitrary or lost to time, and mostly only exist for tradition's sake.

mulmen · on April 26, 2022

I can’t wrap my head around a little endian date.

If you want to write the date little endian then you should do the same with the year. So today’s little-endian date is 26-04-2220. Or maybe that is 62-40-2220? Or is it 62-40-2202?

ISO8601 is the only sane date format. Anything else is only favored for familiarity.

theamk · on April 26, 2022

We do say "five dollars" while writing "$5", so saying and writing different things is not unheard of.

And endiannes / sorting comes up in real life pretty often - scanning for large numbers in the price list, or finding stuff in the sorts list.

I think if history turned differently, we could have had sane time format in the US.

ksdnjweusdnkl21 · on April 26, 2022

Have they not been doing the same with "fourth of July"? Or is this an exception?

gleenn · on April 26, 2022

Our Independence Day is probably a special case. Clearly language is flexible enough to say all the formats, but the date format we write matches the most common verbalization.

chrisseaton · on April 26, 2022

> people for hundreds of years have been saying "March second, nineteen sixty two"

In many other English-speaking countries people usually say "the second of March, nineteen sixty two."

dylan604 · on April 26, 2022

Ides of March vs March Ides....

twelve thirty or half twelve

color or colour

skeletonjelly · on April 26, 2022

Pretty funny America's most patriotic day is an edge case for something culturally American

a4isms · on April 26, 2022

I write it in that order some of the time… But when I do, I spell out the month because when I say “I was born on June 14th, 1962,” I don’t say I was born on 6/14/62. I also never say 14/6/62. In fact, I almost never say a month’s number in conversation.

If you want to write it out the way it’s spoken, write it out the way its spoken. Mixing the computer’s numbers and the spoken word’s grammar make for misunderstandings, and as a programmer, eliminating misunderstandings is one of my goals.

mistrial9 · on April 26, 2022

plus plus to pragmatic - I have settled on 02feb2022 to write, and I do. Occassionally that finds its way into computer programs around here.

registeredcorn · on April 26, 2022

Same, but with spaces, and the month always full capitalized. I learned this habit in the Military as an alternative to 20220222. 22 FEB 2022 is nice because it's neither a string of numbers, which is very intimidating to read if it's written out to include hours minutes and seconds, like 20220222122222. It also completely bypasses the argument around month and day because the format includes spelling specific to the month.

If I'm writing a letter or addressing a specific thing in a formal context I chose to "revert" to the Month, Day Year because it's the social standard for the country I am in, and I want to fit into that cultural expectation, but if it's for a business document or normal chatter I think DD MMM YYYY is probably the clearest to both English & Non-English speakers. It eliminates the distractions I'd normally be dealing with when considering if I'm talking to someone out of country or not. It would be really great if it ends up being more widely adopted.

Hjfrf · on April 26, 2022

The one exception I can think of is a bug in the mssql datetime type (but not date or datetime2) where strings in that format are assumed to be yyyy-dd-mm if the locale dateformat is dmy (e.g. British English).

m-p-3 · on April 26, 2022

ISO 8601 is IMO the only sane format.

HiJon89 · on April 26, 2022

I expected the top comment on hackernews to be something this pedantic and irrelevant to the content, and I was not disappointed

elpakal · on April 26, 2022

I had to read the parent comment twice to understand that it was talking about the date in the title of the post and not anything relevant whatsoever.

missedthecue · on April 26, 2022

I mean, there is literally no way to confuse it with another date, unless you go back 100 years, when Slack didn't exist.

There is no 22nd month, so we know the 22s are the day and the year, leaving only the 2 to be the month. Is it really that difficult to parse?

tjoff · on April 26, 2022

I'm so tired of having to do that game every time I see a date. It is not hard, but it is quite annoying. Especially since it isn't solvable in a lot of cases, so you try to reason your way to the most realistic interpretation.

It shouldn't be this hard.

true_religion · on April 26, 2022

It’s not really that hard. Like the imperial system, Americans just memorize how it works as children and don’t think about it anymore.

Think about it like speaking a different language, except with numbers and not words.

dllthomas · on April 26, 2022

The complaint isn't about the particular other order, but the fact that the order is ambiguous. In this case that doesn't matter, but often it does.

Americans memorize inches and yards, and often also memorize centimeters and meters, and working with either is fine, but we're not so often faced with numbers where it might be inches or centimeters and we have to figure out which (and when we are, it's sometimes a pain - certainly a bigger pain that working with known units).

Or, working with your language analogy, please go fetch me some "pasta" without knowing whether I'm speaking Italian or Polish.

true_religion · on April 27, 2022

I instinctively felt that the ambiguity ought to not truly exist, but I wanted to find some backing. According to this: https://en.wikipedia.org/wiki/Date_format_by_country

… all the major predominantly English speaking counties will use mostly hyphens in the dd-mm-yyyy format. So although there is ambiguity, it’s easily resolved by picking that as the default mentally and only back tracking on failure.

Now in the more general case, this whole thing feels like a lieutenant/leftenant situation. We are annoyed simply because it’s not they way that we do things in a peculiar case, when otherwise the language is fully intelligible.

dllthomas · on April 27, 2022

> … all the major predominantly English speaking counties will use mostly hyphens in the dd-mm-yyyy format.

Yes, which makes reasonable the complaint about mm-dd-yyy.

> So although there is ambiguity, it’s easily resolved by picking that as the default mentally and only back tracking on failure.

This is both more work and also error prone in the general case (although it works out fine in this case).

> Now in the more general case, this whole thing feels like a lieutenant/leftenant situation.

Not at all. Whether I read "lieutenant" or "leftenant", I know what you're talking about. If I read 2-10-23, I might miss your birthday party.

tjoff · on April 27, 2022

Even people in non-English-speaking countries write in English all the time, especially on the Internet.

Picking one default and back-tracking on failure really isn't that comforting nor the constant reminder that the date you thought it was might be something else.

xeromal · on April 26, 2022

Context matters in your pasta scenario.

erpellan · on April 26, 2022

Context also matters in the date parsing scenario. 11/12/22 could be several different dates depending on the context.

xeromal · on April 26, 2022

Yeah, and in this post, it's clear what it is.

ldh · on April 26, 2022

It may not have happened to you yet, but someday you'll see a date somewhere other than this post.

xeromal · on April 26, 2022

This honestly has me laughing.

dllthomas · on April 26, 2022

Since the text itself doesn't clarify, context is the only way of resolving any of the scenarios. In each case it's usually sufficient and often not all that hard. But it's always harder than if the system in use was made explicit, and I understand the complaint (even if my annoyance at the ambiguity is quite significantly below the level where I would have complained myself, particularly in this case).

ascar · on April 26, 2022

> Think about it like speaking a different language

The correct analogy is I don't know which language is spoken and the same words get used in multiple languages with different meaning. Now I can apply heuristics to figure it out or in some cases I can only guess.

noselasd · on April 26, 2022

That particular date is possible to understand, but the date format is not. (Is is really that fun to try to figure out what 12-11-21 means ?)

lkbm · on April 26, 2022

Is there a different post where they used this date format?

None of their other incident reports even have a date in the title. Yet this one does, and in a weird format. Maybe there's something novel about the date, and it's written this was to emphasize the novelty, not to provide some vital information that happens to be excluded on every other incident report title they've posted.

CPLX · on April 26, 2022

November 21, 1812

What do I win?

bamboozled · on April 26, 2022

So we could either:

a) Use a far superior date format which nearly the entire world uses by default and is better and simpler in many ways.

b) Do logic when we see dates to try workout what format the date is in.

Going with a seems like a no brainer...

verve_rat · on April 26, 2022

Especially in this scenario where you are communicating to an international audience.

johannes1234321 · on April 26, 2022

> leaving only the 2 to be the day

I think you meant "to be the month" there. qed■

diarrhea · on April 26, 2022

Beautiful.

burnte · on April 26, 2022

Yeah, but what about that party that I threw on 10/11/12. Did I set it for November 12, 2010, or October 11 in 2012? Or somewhen else?

brigandish · on April 27, 2022

It's a waste of effort and makes me wonder about the competence of the person who wrote it (when looking at a mangled date generally). Display dates are for humans so write the month name, then it doesn't matter what the order is.

22 Feb 22

Feb 22 22 (weird but still better)

22 22 Feb (very weird but still better)

This also goes to show that 2022 is a better choice. My own personal preference - the 22nd of February 2022.

huhtenberg · on April 26, 2022

I just thought it was a typo.

Invictus0 · on April 26, 2022

It's a quirky nod to the fact that all the digits were 2 on that day in this format.

samstave · on April 26, 2022

OMG - I thought I clicked on the tablet thread regarding Sumerian OOOs -- and I thought you were sarcastically making fun of the way the Sumerians captured dates on limestone tablets ~4,000 years ago...

(i had scrolled immediately down, so the thread titel wasnt visible when I was reading your comment)

haha

adamomada · on April 26, 2022

This is what you sometimes see for best-before dates in Canada. Even better, because our dates are “supposed to” be like 22/2 but I don’t think anyone here does that, except Quebec perhaps. Sometimes you just have no clue

rcthompson · on April 26, 2022

The point is to describe the date using only the number 2.

kybernetikos · on April 27, 2022

I know it's somewhat trivial but it does bother me a bit too, because I am used to the dashes being an indicator that ISO 8601 is being used. If you're going to use a nonstandard format, I'd much prefer it not look like a standard one.

jabbany · on April 26, 2022

This. The use of dashes here is a bit annoying IMO ...

Not sure if this is standard but I usually see the delimiter being used to define the date format: big-endian y-m-d uses dashes, middle-endian m/d/y uses slashes, and d.m.y little-endian uses dots.

dragonwriter · on April 26, 2022

> Not sure if this is standard but I usually see the delimiter being used to define the date format:

No. Pretty much every separator / order combination is in regular use.

See the table under “Listing” at https://en.m.wikipedia.org/wiki/Date_format_by_country

bugeats · on April 27, 2022

The incident was caused by a database query that was "2-22-22" when it was supposed to be "2-22-22".

charcircuit · on April 26, 2022

It's a shortened version of "February 22, 2022"

It doesn't seem that bad to turn it into 2-22-22.

smcl · on April 26, 2022

The issue is a great deal of the rest of the world don't do this, so you need to decide whether to apply best-guess heuristics to parse it or decide that it's a typo ("ah there's not 22 months, so maybe it's the 22nd of February or someone fat-fingered the 2nd of February...?").

In this case you can lookup Slack outages to disambiguate it, but the frustration here - and I share it - is directed at the stubborn refusal to use a standard format that the reest of the world has agreed upon.

lucideer · on April 26, 2022

It's also a shortened version of "22nd of February, 2022".

vxNsr · on April 26, 2022

> Am I missing something?

Yes, the numbers are all the same, and the author is based in the US, and thus is using the default format in the US. So odd that this is the top comment.

scrollaway · on April 26, 2022

A post-mortem with non-ISO dates? Even on THAT date?! :)

smegsicle · on April 26, 2022

looks unambiguous to me lol

mulmen · on April 26, 2022

Ok, what is the format string?

  a) m-dd-yy
  b) m-yy-dd

I can't tell. How do you disambiguate?

rjh29 · on April 26, 2022

It can't be either, 22 is not a valid month.

I agree with you though, the point of a date like yyyy-mm-dd is to avoid working out stuff like this. You don't pick a date format based on whether the current date is ambiguous or not.

mulmen · on April 26, 2022

Good catch. I updated my post. The question remains, how can this format be disambiguated?

Agreed, this is why ISO8601 exists.

RandallBrown · on April 27, 2022

It's disambiguated by context, like most ambiguities in language. This is the title of a blog post after all.

lkbm · on April 26, 2022

If you're trying to write a date parser for their outage report titles, the problem isn't the format of this date. It's that this is their only outage report with a date for the title.

It's the title because it's a novel date, and formatted this way to emphasize the novelty. It's also a date for which your question is irrelevant: it's the same either way.

orf · on April 26, 2022

*22/2/22

adamomada · on April 26, 2022

You know how the date looked strange to you? It’s the same for your correction, but for other people

orf · on April 26, 2022

For a statistically insignificant portion of people, sure. It doesn’t make it any less correct.

4ggr0 · on April 26, 2022

For the whole of Europe it would be 22.02.2022, how is all of Europe statistically insignificant?

orf · on April 26, 2022

The official EU rules say 22.02.2022, but nobody in Europe would have trouble parsing 22/2/22 or any variation thereof. And the / (or -) separator is indeed used in parts of the EU.

It’s the ordering that’s significant, not the separator.

4ggr0 · on April 27, 2022

> nobody in Europe would have trouble

Yeah, I guess if people look at it and parse it, they understand it. But what I notice is, that the IT-bubble I am in has no worries parsing these dates because we use them a lot in IT, even in Europe. But people outside of IT do seem confused about US-formatted dates from time to time, because they rarely encounter them.

More than once did I notice someone struggling to fill in their birthday into an online form because the people making the form decided to use M/D/Y instead of D.M.Y Of course most can help themselves, but its not like these date formats seem natural or normal to everyone in Europe.

Symbiote · on April 27, 2022

The separator is often a good clue. Dots and dashes strongly imply d-m-y, slashes imply an English date, which might be m/d/y if it is from North America.

A mixture is even more likely to be d-m-y, today is 27/4-2022 in Danish handwriting.

4ggr0 · on April 27, 2022

> 27/4-2022 in Danish handwriting

I've never seen a date written like this, interesting.

godmode2019 · on April 26, 2022

Tidbit:

2-22-22 was also when Russia invaded Ukraine

And Joe Bidens statement about the invasion was on 2-22-22 2:22pm on the dot.

I could not figure out the significance of this more than 11:11 was when WW1 ended, but it's probably something else.

Kwpolska · on April 26, 2022

Nope, you’re off by two days: https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukrai...

godmode2019 · on April 27, 2022

To refresh your memory

https://edition.cnn.com/europe/live-news/ukraine-russia-news...

godmode2019 · on April 27, 2022

It was the announcement of invasion. 22-2-22

Then Biden did a speech against on 2:22 22-2-22

The actual war started in 2014

ethical · on April 27, 2022

Doesn't anyone learn anything. Youth is not an achievement. Experience is. Having to refer to a book, rather than good change management practices, highlights the madness of agile tosh, and ignorance in terms of capacity and performance management. It's also a data breach incident (denial of service, unavailability), I hope they reported this to the UK ICO, and each countries data protection regulator. Amateurish rubbish.

sofixa · on April 27, 2022

> Youth is not an achievement. Experience is.

Old man yells at cloud vibes here.

> It's also a data breach incident (denial of service, unavailability)

How is unavailability a data breach?!

ethical · on April 27, 2022

1: Yup. 2: Its the legal definition under the GDPR ( a European law, that applies to the US, via Privacy Shield). If you can't get to your data - you get the idea?

TLDR;

macintux · on April 27, 2022

> The notification obligations under the GDPR are only triggered when there is a breach of personal data which is likely to result in a risk to the rights and freedoms of individuals.

https://www.lexology.com/library/detail.aspx?g=03e8a988-7c9e...

encryptluks2 · on April 26, 2022

> including the author — which certainly made my role as Incident Commander more challenging!

As if no other way to communicate exists?

I remember using Slack, feeling fed up with emails, until I realized that if I wanted to sync Slack messages offline and have a standard way to view these messages that I was SOL. I am so glad that I've returned to email and optimized my workflow to use email effectively and efficiently. The best part is no more vendor lock-in.