IBM Cloud was down, as well as their status page

colinbartlett · on June 9, 2020

My side project StatusGator monitors status pages (including IBM's ill-fated page) and I'm seeing more than 10% of the nearly 800 services we monitor having an outage right now.

So it appears to affect anyone who depends on IBM Cloud.

bberenberg · on June 10, 2020

I really wonder how people get value out of a meta status page when my experience is that status pages are often incorrect about what the actual status is. Whether they're manually updated, or it's a case of "your 9s are not my 9s", it seems like a compounded broken telephone problem.

pas · on June 10, 2020

Probaly it's great to have some very big picture overview. Both in scope ("all" the cloud, and both in time as in "all" time, and maybe there's even some value in looking at the correlation of these).

Maybe it helps with doing a sanity check before picking a provider. And, I guess, at a basic level it helps with accountability/transparency.

o-__-o · on June 10, 2020

When we started broadcasting status checks we thought it would be a great way to let users know what's happening behind the scenes. Then it turned out a lot of our happenings were self-inflicted so rather than being less transparent by way of less posts, we just tweaked our status outputs. "We are experiencing an event in <SOME VERY HIGH LEVEL SERVICE>" x1042. After we perform our RCA then we may post a blog about it or a short PR blurb

The last time we went down I questioned out loud the point of the status page and the general consensus was for others to be able to reference our outage.

Jedd · on June 9, 2020

Big fans of StatusGator here.

Do you have similar %'s of monitored cloud services that have gone off the air during other providers' outages?

colinbartlett · on June 9, 2020

Thanks and that's a GREAT idea for some detailed analysis. I have been trying to make better use of the 5+ years of status page history stashed in the cupboard.

bigiain · on June 10, 2020

This is something I'd totally read.

(You should headhunt the guy at BackBlaze who does their hard drive stats blog posts, and release this data analysis quarterly!)

atYevP · on June 10, 2020

Yev from Backblaze here -> please do not do this, we really like him.

bigiain · on June 10, 2020

How about:

(You should contact the guy at BackBlaze who does their hard drive stats blog posts, and pay him to do this for you as a side hustle - and release this data analysis quarterly!)

:-)

brian_herman__ · on June 10, 2020

Yeah I think the backblaze does great work for backblaze for their hard drive report I can’t wait for him to talk about the WD SMR fiasco!

darkteflon · on June 10, 2020

Only on HN.

ComputerGuru · on June 9, 2020

So what are HNers using IBM Cloud for and where do you see that it has an edge over AWS offerings (where an overlap exists, obviously)?

(I figure either you’re in devops and you are putting out fires too busy to read this thread or you’re not and your work is halted because of the incident so you might have time to read and reply ;)

paranoidrobot · on June 10, 2020

My previous job used Softlayer heavily.

Two of the biggest advantages were:

Price for hardware. As a base price, their bare-metal gear was significantly cheaper than equivalent-specced AWS gear (if it was even possible to get something like that). We managed to snag quite a few 'interesting' configurations of things at various times that you just couldn't get at all in AWS. Things like PCI SSDs, very large RAM configs, or High-Frequency low-core count CPUs.

Free international/regional transfer. We took significant advantage of this to move data around. We'd replicate TBs of data around.

At various times management and dev teams would complain and say that we should move everything to AWS (or whatever cloud provider they'd just met with at a conference).

We consistently showed higher performance and lower cost by significant margins. On cost alone, we were paying a small fraction of what it'd cost on AWS, even after taking into consideration ways to reduce cost on AWS such as scaling, spot instances and reserved-instances.

jgalt212 · on June 10, 2020

I really would like to see an AWS memo which maps out common use cases and expected costs (selling points) vs actual use cases and actual costs (pain points).

paranoidrobot · on June 10, 2020

I don't think there's one good answer for everyone that's going to be right.

I think the biggest issue is that far too many people assume that the AWS Savings stories are universally applicable, and that it's safe to assume AWS is going to be the cheap option.

I'm sure there are folks for whom AWS is the cheap option, but it wasn't at my last job, and it's not for my current one (even though they are using it).

goatherders · on June 10, 2020

Surely not from AWS themselves. They have zero reason to disclose data showing how expensive their services are relative to expectations.

dahfizz · on June 10, 2020

> They have zero reason to disclose data showing how expensive their services are relative to expectations.

A cynic might argue that is why their pricing structure is so complicated in the first place.

dsmcr · on June 10, 2020

More likely their service offerings are growing so fast that trying to make the pricing structure coherent is like a hyper-aggressive game of whack-a-mole.

toast0 · on June 10, 2020

We used Softlayer (rebranded to IBM Cloud, and affected by this) at my last job. For the most part, their service pretty much just works; clearly not today. :)

We had a couple thousand bare metal servers, and barely used any of their API stuff.

As with any facility, there were occasional issues with electrical transfer switches, core router failures, fiber cuts, etc. Stuff happens, but we got pretty good communication, and things got resolved in a reasonable amount of time. Service got noticeably worse after IBM, but we were already planning to move to our acquirers hosting, because that's what happens when you're acquired. Oh, and their load balancers had garbage uptime.

Bandwidth prices used to be pretty reasonable, but they've adopted AWS style obscene pricing. At least they still let you use the private network for free (including to other datacenters).

dang · on June 10, 2020

HN ran on a box at Softlayer until early 2018 or so. This makes me think that the title of this post (which was submitted as "IBM Cloud down as well as their status page which looks to be hosted there") could at some point have been "IBM Cloud down as well as their status page which looks to be hosted there as well as the forum where people post these things which also looks to be hosted there".

pvg · on June 10, 2020

HN ran on a box at Softlayer

I've always imagined it as a big tower shoved under someone's desk. The side panel of the case is off because otherwise it overheats. On the screen there's a single maximized window of DrRacket. A post it note warns you not to quit or reboot the system.

MaxBarraclough · on June 10, 2020

And there's a switch set to More magic.

dreamer7 · on June 10, 2020

Maybe that's the new location :)

po · on June 10, 2020

If it was a BGP issue, it's not a problem of where the status page is hosted but instead that you just can't get to it via that name no matter where it's hosted, right?

0xbadcafebee · on June 10, 2020

If by "name" you mean hostname, not really. If you have a domain with multiple nameservers in multiple countries on multiple providers, and your site is similarly globally distributed, at least a few people on the internet will always be able to pull up your site. So at least some of your clients will be able to resolve a domain address from at least some of your nameservers and connect to at least some of your web servers. Geo-IP and Anycast are also really useful here.

edit: It's possible that you could take out an entire TLD and make it impossible to resolve domains on that TLD once all the cached records expire. But that kind of targeted attack would not be possible with a BGP error, unless it was a very specifically crafted BGP error happening over a very long period of time (weeks-months-years depending on the record TTLs).

po · on June 10, 2020

ok, that makes sense but you'd still at least have: if the site is down due to BGP then so is the status page that is on the same domain.

I guess I'm just calling out the people who are making fun of them for having their status page dependent on the same hardware it's monitoring when it's not clear that's the case just because they are both down?

I would suppose if it's a different TLD domain, then it would be more likely to conclude that.

0xbadcafebee · on June 10, 2020

A status page should be a static site hosted on multiple providers in multiple regions with multiple nameservers. So, Amazon S3 hosted in 2 regions, Azure Storage hosted in 2 different regions, 2 different nameserver providers in 2 different countries using two different backend colo providers. Costs probably <$150/year and that will survive BGP outages, backhaul link outages, hosting provider outages, DNS outages.

I'm not going to make fun of them for their status page being down, but it certainly doesn't reflect well on the brand/products.

ipsum2 · on June 10, 2020

Where does HN run now?

dang · on June 10, 2020

M5. https://news.ycombinator.com/item?id=16076041

dsmcr · on June 10, 2020

You guys were one of the best use cases for the SL model, which really hasn't changed in 10+ years. You had very few dependencies on the less-reliable (read: all of them) services inside the SL stack and mostly managed everything on box and in software. In a few POPs you guys were running about 50% of the total SL backbone bandwidth. There were a lot of sad panda hats when you guys started to transition away.

toast0 · on June 10, 2020

> There were a lot of sad panda hats when you guys started to transition away.

For us as well. It was so nice to have things work one day and the next and the next, although I guess they wouldn't have worked today.

Favorite firefighting moment was when wdc lost half the fiber in ~ 2014, and we had to move all of our traffic out, so that there was capacity. Our guy asked why we had to move? and your guy said something like 'Because if you guys move, we only need one customer to move.' :D

dsmcr · on June 10, 2020

Yeah the move from FreeBSD to Linux wouldn't have been fun for you guys either. And yeah, the WDC POPs were some of the most overbuilt from a bandwidth perspective and that was almost entirely because of you guys. Pretty sure there's a Cisco sales rep enjoying a nice holiday home in Connecticut as a result of the growth you guys did.

Scoundreller · on June 10, 2020

Dunno if I should realize who toast0 is, but what service were/are you running?

matthijs · on June 10, 2020

Probably WhatsApp, they moved from FreeBSD to Linux and were on SL before the Facebook acquisition.

toast0 · on June 10, 2020

I don't think you are expected to know who I am :) I omitted the service on purpose, but you my email is on my profile if you want to know.

Apparently it was enough information for dsmcr to properly id the service though; not enough for nixgeek though, I think.

Scoundreller · on June 10, 2020

I won't ask that directly. It just seemed like I was missing something that everyone else knew.

dsmcr · on June 10, 2020

Sorry, didn't mean to put you on blast like that.

toast0 · on June 10, 2020

Don't worry about it. Not a problem at all. Happy to interact with people on the other side of the tickets ;)

simscitizen · on June 10, 2020

They’re talking about WhatsApp.

dylz · on June 10, 2020

I had a high bandwidth use case that SL filled when I was a kid - but we went through resellers 10TB/UK2, later 100TB after that was a thing, then they dropped SL and every one of SL's products became AWS-priced levels of insane per GB bandwidth.

The odd thing is, for half the price I could get SL service w/10TB from a reseller, while at list price I only got 1-2TB bandwidth, and sales absolutely would not budge on that. I wonder why.

nixgeek · on June 10, 2020

Last job for me was also a few thousand bare metal servers at SoftLayer. Acquired and moved to that infrastructure instead. Wonder if its the same acquisition? :-)

dsmcr · on June 10, 2020

FWIW - IBM Cloud today has basically no benefit over AWS, Azure or GCE or even against some of the smaller regional players like AliCloud. The notable exception would be if you need to run a bare metal solution and leverage their free backbone which is a pretty narrow use case these days. The main selling point previously was to stand up your own VMware environment but even that came with a laundry list of caveats (unsupported hardware, limited VLANs, non-flexible IP space) that made it painful to use. Today AWS is vastly more performant, flexible, reliable and has a bunch of useful services you don't get from IBM Cloud.

nojito · on June 10, 2020

Price isn’t a benefit?

dsmcr · on June 12, 2020

It would be except IBM Cloud is never cheaper than AWS.

Xenoamorphous · on June 10, 2020

If we speak specifically about IBM Cloud vs AWS, we use the Natural Language Understanding API in IBM Cloud and as far as I know the equivalent AWS offering, Comprehend, doesn't provide named entity disambiguation nor links to knowledge graphs (IBM links to DBpedia).

MS and Google do provide those features though.

dsmcr · on June 10, 2020

Unfortunately, that API changes regularly and often in undocumented ways that causes breakages for customers. Its really a lot of fun to deal with when suddenly a bunch of automation breaks and it turns out an unannounced push fundamentally re-writes foundational API calls.

fieldmarshal · on June 10, 2020

We have been leasing bare metal servers since the pre-IBM Softlayer days.

Over the past few years we have experienced quite a few network-related outages. Not usually to this extent, more generally a failure of some piece of network gear that takes out either backend or frontend traffic from a particular data center. We seriously priced out a migration to another provider recently, but in the end what held us back was cross-AZ transfer costs on AWS. We found it would raise our operating costs significantly, so the matter was dropped.

We were much happier with the service and support we received prior to the IBM acquisition.

sky_rw · on June 10, 2020

I had originally signed up due to the availability and pricing of bare metal servers and the mixed Windows/Linux server offerings. Their windows server licensing was better than AWS and I didn't want to be on Azure for a variety of reasons.

Currently on them because we have an OpenVPN based infrastructure that is very challenging to migrate.

Lastly the majority of our customers are in the midwest or Texas, and the proximity of their Dallas DC was a huge performance win for us.

manquer · on June 9, 2020

Rarely it is a just technical decision, usually money is the reason.

In small and mid size organizations the CSP gave better pricing, or they help with your sales etc

In large organizations - IBM/Oracle bundle their existing products currently being paid for any way, or account managers have great relationships with decision makers , the company already has signed up big multi year deals.

This is not just IBM, it applies to GCP/Azure/AWS as well.

nihil75 · on June 10, 2020

I like OpenWhisk which is the basis for their serverless compute offering. Has orchestration/state-machine functionality that makes it superior to GCP Cloud Functions, and uses Docker containers which makes it more flexible than AWS Lambda.

I also really like CouchDB which IBM Cloudant is based on.

Is that enough for me to use IBM cloud? no. not really.

spydum · on June 9, 2020

I suspect nobody really uses it outside of weird outsourced financial modelling/planning tools like TM1 and other apps people stopped wanting to manage themselves.

freehunter · on June 10, 2020

I work as a consultant with big enterprise companies and I can assure you big enterprise companies are using IBM very heavily. As well as Oracle and HP and other uncool tech companies.

rezonant · on June 10, 2020

We use Restream.io and Solar Winds Papertrail, both were down today, my guess is they use IBM Cloud itself or some rackspace that IBM's rented to other clouds, which is apparently typical at the edge of the major public cloud regions

blantonl · on June 9, 2020

All of Broadcastify's audio servers (hosted with Softlayer in their Dallas datacenter) are completely unreachable and down.

I'm going to wait a bit to see if we get a status update, otherwise we'll be spinning up instances on AWS to failover (which will be enormously costly for bandwidth)

No status, no nothing, we're in the dark.

Operyl · on June 9, 2020

Hey. Do you want to shoot me an email, IRC chat, or anything? I can keep you up to date with what I'm hearing from my manager.

dashesyan · on June 9, 2020

Hey, I'm a customer of IBM Cloud, too. Could you share what you're hearing from them? It would be nice to know what's going on

Operyl · on June 9, 2020

So far? Pretty much no news, they're using Slack to communicate a bit. VPN access for everybody is broken or barely working, no internal ticketing as a result. As far as I can tell, private networking is mostly working between servers (at least, for my servers, ~60).

chadcmulligan · on June 9, 2020

[flagged]

Operyl · on June 9, 2020

Everything I'd say would be public knowledge, of course. I don't work for IBM, I'm a user :).

chadcmulligan · on June 10, 2020

You never know which way levity will go on Hacker news :-)

Fordec · on June 9, 2020

I remember I was at an IBM sponsored hackathon around 2015 where it was a requirement to use Bluemix. Over the course of the weekend the service went down for hours 3 times.

Literally this morning I was wondering what ever happened to it, like did it die a quiet death? Oh it rebranded to IBM cloud in 2017. Now this news.

I think there's an eponymous law named for this sort of thing.

kinghuang · on June 9, 2020

That's funny. I've had the exact same experience with Bluemix at a Hackathon in the past. It was down for almost the entire weekend, screwing all the teams that didn't pivot early enough.

on June 10, 2020

[deleted]

oehpr · on June 10, 2020

Thanks for keeping us up to date.

The timing on that incident was a weird coincidence for our team. We had just rolled out a bunch of production updates, and then 5 minutes later ALL our VM's went down and I freaked out and tried to diagnose what was happening and recover before I started noticing all the other people going down.

So. Hate to ask this. But what happened to your old posts that are now just "-"'s?

I mean I can guess.

It's just that the communication blackout while the incident was happening was not appreciated. And you stepping forward to let us know what was going on was very appreciated.

vmh1928 · on June 10, 2020

In the Cloud Status History page scroll down to the 6:32 entry that says "Unable to Access IBM Cloud"

https://cloud.ibm.com/status?selected=history

- 2020-06-10 02:19 UTC - RESOLVED - The network operations team adjusted routing policies to fix an issue introduced by a 3rd party provider and this resolved the incident

voz_ · on June 9, 2020

I generally do everything on AWS or GCP, with a little Azure sometimes for personal projects. In what world does IBM beat one of those three in anything? Generally curious - how they are able to stay competitive?

twalla · on June 9, 2020

Their bare metal cloud offering (SoftLayer acquisition) was actually pretty good whenever I used it about 4 years ago. Wasn’t the most intuitive API or UI but you could get a bare metal server anywhere in the world in a few minutes.

rad_gruchalski · on June 9, 2020

When the wind blows in the right direction. Sometimes, your server would get stuck in provisioning for hours and only get „un-stuck” after creating a support ticket. Which, I kid you not, at one of the previous jobs, wd had automated in our provisioning popeline. Good times.

But when it worked, it worked. API was voodoo.

bashinator · on June 10, 2020

I just discovered this today:

    aws support create-case \
        --subject "not working" \
        --communication-body file://description.txt

Kinrany · on June 9, 2020

Fixing provisioning based on support tickets might have been automated on their side too :)

jonfw · on June 10, 2020

They've got the only real managed Openshift option right now, and their managed Kubernetes services is really great and seamless IMO.

wmf · on June 9, 2020

They had bare metal before Packet or AWS and inter-region traffic is free.

Operyl · on June 9, 2020

Their biggest thing going for them is 100% free dark fiber private network. You do have to pay for the bigger pipe (100mbps included for each server, a minimal upcharge for gigabit), but that's pretty much a rounding error.

blantonl · on June 9, 2020

Softlayer

blazefox69 · on June 10, 2020

Fixed it for you https://github.com/ibm-cloud-docs/overview/pull/74

caiobegotti · on June 9, 2020

Honest slightly cynical question: most probably someone inside the responsible team said some day that it would be very stupid to host the status page inside the same infrastructure being monitored, but they were probably ignored... what should that person do now? Say "toldya!" out loud in the postmortem meeting or simply shut up and move on because reality is that we are hired to do some stupid task and not to think for ourselves?

manquer · on June 9, 2020

If such people raised concerns and they had been overridden they way you describe they would sadly have left long ago.

It is not that companies become consciously malicious or are incompetent to start with, it becomes a vicious cycle, as more and more poor management and engineering talent join, the good ones leave, and the cycle continues.

Acquisitions and merge stave off the slow slide into irrelevance for a while, till the best of the new guys leave too. Systemic cultural changes is very very hard to achieve in large organizations.

all_blue_chucks · on June 9, 2020

Never humiliate a coworker in public. Instead say "both options were considered but ultimately it was decided to select option B for reason Y."

caiobegotti · on June 9, 2020

My professional experience tells me that the next question will be who decided for B given Y, then you answer it and then you have a target on your SRE back, I'm afraid. Remember that the trickle-down economics works only when the shit hits the fans and what trickles down is not money.

all_blue_chucks · on June 10, 2020

If your company uses post mortems to blame individuals rather than fix processes and tools, you haven't worked in a professional environment.

FrankPetrilli · on June 10, 2020

I've worked in these types of organizations before, and it's always counter-productive to making positive changes in the environment. You as an SRE can either make it better, or find an environment that's conducive to positive changes.

If you'd like an outside resource to suggest or read up on better postmortem practices, the Google SRE Book has a chapter [0] on postmortem culture. It's an amazing change of pace and a huge stress level improvement for us SREs.

[0] https://landing.google.com/sre/sre-book/chapters/postmortem-...

pmarreck · on June 10, 2020

So, the appeal to anonymous authority, I see

ethbro · on June 10, 2020

What if Y = ignorance?

jonfw · on June 10, 2020

Where are all of these managers out there running companies making decisions with literally no thought whatsoever? I've literally never seen them- I almost exclusively work with rational human beings who are able to justify decisions, and the few who aren't haven't been afforded any real power.

ethbro · on June 10, 2020

Mostly legacy industries. I don't want to indicate it's common. And typically gets weeded out by Director-level.

But I've certainly seen more Top50 companies than not who have at least a few Manager / Sr Manager-level folks, in charge of key teams who own the sole keys to necessary functions, who are happy to say no to anything they're ignorant of, without any impetus to learn about it.

jedberg · on June 9, 2020

I would mirror the attitude of the person who said no originally.

If they are receptive to feedback and clearly want to do better, I would be kind and explain why I had suggested it not be there in the first place and cite this as an example.

If they were being adamant or denying it was their fault, I'd probably be really quiet and just make subtle remarks about how it would have been better if they listened.

whyleyc · on June 9, 2020

Totally unrelated Jeremy, but did you know the SSL cert on https://minops.com/ has expired?

(Was interested to see what you were up to these days, which is how I stumbled on it).

jedberg · on June 9, 2020

Lol yes. It's intentional. It stops spam bots from trying to sign up, since we aren't open for signups yet.

But don't worry, you're not the first to mention it. I suppose I should just fix it and deal with the spam like normal.

I liked the unintended effect of cutting down on spam. I guess a lot of spam bots are written on top of standard libraries that reject bad certs. :)

Also, this was ironically a great way to publicly call someone out for a seemingly bad decision without being cruel about it, so props to you!

arkitaip · on June 10, 2020

This has to be the most unconventional anti-spam technique I've ever heard about.

rezonant · on June 10, 2020

> I stumbled on it by accident. I was lazy and let the cert lapse, but then noticed that spam signups basically stopped. One day maybe I'll make a post about it with graphs, although I'm not sure I actually have the data.

This is intriguing. I'm going to remember this but I'm too anal about perfect A+ TLS and renewal is already fully automated these days anyway :-\

I wonder if one could setup their TLS stack to get this effect without the tradeoff...

rezonant · on June 10, 2020

My apologies for the limited nesting at the hn nestlimit > You could probably get the same effect with a self signed cert. Although that wouldn't get you an A+ on TLS. :) > Also, if y'all do this, it probably won't work because the spammers will start ignoring expired certs.

Yeah, even if you could find a way to deny the spammers via esoteric configuration, it'll just make them realize they forgot to turn off TLS validation anyway (which is clearly what they meant to do)

jedberg · on June 10, 2020

You could probably get the same effect with a self signed cert. Although that wouldn't get you an A+ on TLS. :)

Also, if y'all do this, it probably won't work because the spammers will start ignoring expired certs.

jedberg · on June 10, 2020

I stumbled on it by accident. I was lazy and let the cert lapse, but then noticed that spam signups basically stopped. One day maybe I'll make a post about it with graphs, although I'm not sure I actually have the data.

mh- · on June 10, 2020

Minops is neat, first I've heard of it.

(at least partly tongue-in-cheek) will it support DDL too? can I INSERT infra? or is this a read-only endeavor? :)

jedberg · on June 10, 2020

Read only at first, then write too. It's hard to do writes though because you have to guess at what the person intended.

If someone does 'DROP TABLE ec2.instances', what exactly are they trying to accomplish? Do they want to terminate every ec2.instance? Should we let them?

Questions like that make write access very difficult.

mh- · on June 17, 2020

haha, yeah. very cool, though. would love to chat when you're ready to share if you're looking for feedback.. I've got some features to suggest that would (IMO) increase the value prop.

rezonant · on June 10, 2020

Sometimes this is the only way :-/ Good to ensure you "get it in writing" when the point is eventually proven in production.

isclever · on June 9, 2020

It happens a lot, when you have so much infrastructure and redundancy you think it is too big to fail. Then you lose S3 in US-East1 and break everything.

https://www.theregister.com/2017/03/01/aws_s3_outage/

mc32 · on June 9, 2020

Don’t bite or embarrass the hand that provides you make-work...

Seriously they probably tested it and it worked in theory, just not in practice and now they fix it for reals.

sky_rw · on June 10, 2020

These are the same people who recently published a white paper on how they guarantee zero downtime: https://cloud.ibm.com/docs/overview?topic=overview-zero-down...

The idea that they could even get to this point probably seemed unfathomable. It does to me.

fogetti · on June 9, 2020

While that is a solid advice, still it raises the question: is there such organization which corrects itself by firing the incompetent and promoting the competent instead (the "toldyouso" guy in this case)?

Or we just simply accept and making it the norm that even the lowest level of organizational governance is corrupt?

I am serious about this, because how people perceive their own rights, their own roles, their own status, their own influence and their organization's wrongdoing will influence the attitude in the long run against each and every organization in society in my opinion.

I know that I was blowing the question out of proportion, but it bugged me to ask anyway.

Avicebron · on June 9, 2020

I particularly like The Gervais Principle as an explanation of why incompetents are frequently the ones making corporate governance decisions. It's satire, but I think there are elements of truth in it.

https://www.ribbonfarm.com/2009/10/07/the-gervais-principle-...

acruns · on June 9, 2020

I guess if their DR firedrill assumed their failover router hosting the status page could never go down it would pass, but come on IBM.

detaro · on June 9, 2020

Seems a common mistake. If I remember right, AWS stumbled over their status page depending on S3 a few years back

mbreese · on June 9, 2020

I view this as growing pains that everyone has to learn the hard way. The bigger question is will they learn the lesson and how will they fix this for the next time?

(Because there is always a next time)

snowwrestler · on June 9, 2020

Best thing to do now is to point DNS at the backup status page they discreetly set up on a free-tier EC2 server back when they got ignored...

skybrian · on June 10, 2020

Ideally, someone should write a postmortem with a timeline of what happened and recommended fixes. These would then be fixed, and nobody would be blamed. (This is called a "blameless postmortem".)

But whether you can get away with that depends on culture.

vsareto · on June 9, 2020

Honestly it's small compared to everything else. I'd rather leave than do told-ya-so though and put the story in the exit interview or reason for leaving.

dexwiz · on June 9, 2020

I built a status page for a top cloud provider, and this was question number one from SREs.

bigiain · on June 10, 2020

I built a bunch of CloudWatch monitoring for an AWS stack, and duplicated critical monitoring using a 3rd party monitoring service as well. So, because the universe hates me, the 3rd party service migrated their hosting into AWS ~18 months later... :sigh:

epc · on June 10, 2020

I haven't been at IBM since 2001, but when I was there any suggestion like this would have been beaten down by multiple layers of the big grey cloud for even intimating that such a visible, key piece of IBM marketing material should be on a third party service.

danjac · on June 10, 2020

They would have been told "great, we understand your concerns, please add a card to the tech backlog (or whatever local jargon is at that team) and we'll address it in the next sprint/dev cycle/quarter". That's the right way to acknowledge these complaints while preventing actual progress.

Lyren · on June 9, 2020

I received communication ~15min ago that they're actively looking into the issue. I submitted the ticket roughly 20min ago. So it seems they're aware.

It doesn't help that their status page is also hosted on IBM Cloud.

whyleym · on June 9, 2020

Found this from a user on Twitter - "Our status page for IBM Aspera is on StatusPage, so you can track here as a bank shot: https://status.aspera.io "

on June 9, 2020

[deleted]

whyleyc · on June 9, 2020

Is there talk of an ETA for any fix yet?

on June 9, 2020

[deleted]

toast0 · on June 10, 2020

I managed to capture a traceroute on the lg.softlayer.com from seattle to my home near seattle, that went via London (networklayer to london, then telia back to seattle). lol

Looks like things are getting better now though; looking glass says seattle and dallas can get to me without going to crazy destinations. wdc is still icky.

rubatuga · on June 10, 2020

I thought BGP shouldn't take longer than half an hour to propagate?

Narkov · on June 10, 2020

That's true however an external party might still be sending the bad advertisements that cause the issue. IBM can't really fix that.

rubatuga · on June 10, 2020

Well IBM could start announcing all their routes as /24.

bigiain · on June 10, 2020

Does it smell of fuckup, or of intentional BGP hijack?

nixgeek · on June 10, 2020

If it was a BGP peer who normally sends you 3 prefixes with under a /20 in aggregate and they suddenly started sending you a whole table, or if you allowed a peer to send you a default route, then both of those are highly avoidable through session configuration and filtering.

If the route which caused the madness came in via a large settlement-free peer (like a big eyeball/access network) or a transit (which is probably giving you the whole table) that's entirely another story.

on June 10, 2020

[deleted]

nixgeek · on June 10, 2020

That wasn't my point though. If you normally take 3 prefixes and suddenly you are receiving 800k+ prefixes then that likely classifies as "unexpected" (and avoidable since you can define max prefixes accepted per session).

I wasn't suggesting you could, I was questioning whether you should. :-)

rabee3 · on June 10, 2020

Thank you

aiisjustanif · on June 10, 2020

cloud.ibm.com/status?selected=status is back up.

Multiple cloud services are being reported as down still at 7:25PM CT.

wbl · on June 10, 2020

Goodspeed brother. We've all had days like that.

late2part · on June 10, 2020

Might want to look into that sysdig maintenance too...

spalanis · on June 10, 2020

I'm here from the Sysdig infra team, and wanted to note that the timing was just a coincidence. Our maintenance was a pre-scheduled upgrade to our platform that was communicated in advance, and unrelated to any outage.

gatvol · on June 9, 2020

Well if they cannot foresee this eventuality, what else are they missing under the hood?

julianeon · on June 9, 2020

Seems pretty dumb to host a status page in a way that it could go down, when it should be a static page that is trivially hosted on CDN's worldwide.

koolba · on June 9, 2020

You can’t cache it for that long though.

A better approach is to have it hosted on a different cloud platform. If you really care, you’ll set it up on a different domain and nameserver as well with a long lived redirect (cached on CDNs) from the usual status.example.com or example.com/status.

julianeon · on June 10, 2020

Thanks; you're right - the caching would be a problem, so your solution makes more sense.

syshum · on June 9, 2020

Over confident in their own Cloud

"Our cloud can never go completely down We are IBM, we have Watson..."

sky_rw · on June 10, 2020

Their status page also seems integrated into their internal support ticketing system. It's not a traditional status page. They wanted to maintain a consistent garbage interface to keep it inline with the rest of their administrative service.

sky_rw · on June 9, 2020

The most infuriating thing about this is the ZERO communication coming out of IBM Cloud. No emails. No updates to twitter. Status page down. Support lines clogged.

At least give me something I can point my customers at to show them this is not due to my incompetence.

bizt · on June 9, 2020

Yep, super annoying I had to link my customers to a techcrunch page :(

shaabanban · on June 9, 2020

Also still no communication from IBM that anything is wrong.

Operyl · on June 9, 2020

Account managers are texting, but they have no VPN access right now.

adrr · on June 9, 2020

It seems all their external network connections are down. I assume people will have to drive to the data centers to fix. I really want to see a post mortem on this outage.

wmf · on June 10, 2020

The data centers are staffed 24/7 and out of band is also a thing.

mark-r · on June 9, 2020

Let me guess, their VPN authorization runs on the IBM cloud.

_lqaf · on June 9, 2020

Dogfooding is great, but you need to think it through...

akerro · on June 10, 2020

Haha, amazon had the same problem a few years ago when they had fire in datacenter, their status checker page was hosted in the same building and was showing everything is fine, while 1000s of websites hosted on AWS were down.

shaabanban · on June 9, 2020

wonder if we'll ever get a post-mortem about this... Seems to be global

Operyl · on June 9, 2020

Maybe. About 3/4 of all outages get a post mortem. There's 1/4 of the time they refuse to tell us anything.

mbreese · on June 9, 2020

There will have to be a post mortem on this. The convention is to be as transparent as possible as to what went wrong. This helps to let current customers know that you found the problem, and have put plans in place to make sure it doesn't happen again.

The purpose of the signalling here is two fold.

1) If convincing enough (with details), you can keep current customers from moving to a competitor.

2) It also lets new customers see how you actually handle a crisis. If they can manage the crisis well enough, then you can point to this instance to prove your technical knowhow to handle their needs.

If they don't tell anything, or aren't transparent, then they can expect a mass exodus of customers.

bigiain · on June 10, 2020

> then they can expect a mass exodus of customers

I wonder if that's a thing that would even cross a typical IBM-ers mind? It might just be me, but I get a very strong smell of "We're IBM! There's nowhere else for you to go!" from them...

colinbartlett · on June 9, 2020

Do you actually have data on that or are you conjecturing? Because I would really love to see data about that if it exists somewhere.

Operyl · on June 10, 2020

I'm talking from experience. Most things do get post mortems, but there's a lot of crap they also don't give us post mortems for "because customer data." It's my number 1 complaint, and I fight with managers about this all the time. We have a ton of hypervisor problems, and a lot of networking issues (generally over private network) and they tend to get very very secretive about it.

toast0 · on June 10, 2020

I didn't use their hypervisors, but I've had a lot of experience troubleshooting their networks. They've gotten a lot better at proactive monitoring, but we used to occassionally find some private networking paths that were having trouble, and until we narrowed it down, it was hard to find. (I dunno, I guess you can't just ask all the routers if there are any ports with errors, but sure enough, when they found the right port, there was usually a huge error count, or something)

The key thing is each IP 5-tuple (peerA, peerB, protocol, portA, portB) will always take the same path over their network (most likely a different path for return packets, when A and B are switched), so in order to properly probe, you need to probe on a lot of of port combos, and once you find a broken combo, you need to run MTR on those ports, so you can give them the MTR that shows the issue.

Or, if you can, have your internode protocol run on multiple connections and drop connections that are showing issues, and let a different customer file the tickets :)

(email is in my profile if you want to discuss)

mbreese · on June 10, 2020

IBM cloud specifically or just in general?

Operyl · on June 10, 2020

I'm talking about IBM Cloud specifically, yes.

redler · on June 9, 2020

I certainly hope so, considering all the IBM customers that are going to have to explain this to their customers in turn.

thephyber · on June 9, 2020

How sure are we that this outage is limited to IBM cloud?

Pindom[1] had a spike of website outages from 11k => 27k.

[1] https://livemap.pingdom.com/

Nextgrid · on June 9, 2020

It's most likely customers of IBM cloud whose systems rely on something hosted there and are thus down as well.

thephyber · on June 9, 2020

Yes, I considered that possibility before posting.

rat9988 · on June 9, 2020

I'm not sure what you are trying to prove with your comment then.

TallGuyShort · on June 9, 2020

Sometimes people ask questions when they aren't sure and don't have anything to prove.

thephyber · on June 10, 2020

My top-level comment was trying to ask for any evidence whether the downtime was related to a wider network issue (eg. BGP, backbone) or if it was specific to (part of) the IBM cloud.

The link was just a data point, not evidence of anything in particular.

AaronFriel · on June 9, 2020

Ah, is this the exception that proves the rule that "no one was ever fired for buying IBM?"

Sorry to be glib, I'm sure it's a tough time for people who were sold on their cloud platform and work on it!

mark-r · on June 9, 2020

Everybody's cloud goes down sometime. The big fail here was hosting their status page on the same infrastructure.

oceanswave · on June 9, 2020

But usually only a single AZ or region... seems like this is bigger?

Operyl · on June 9, 2020

Yup .. hit us pretty badly. Our account manager doesn't know either.

homeglue · on June 9, 2020

I've seen multiple services get affected this morning including Sendgrid, Nexmo and Up bank, all at the same time. Wondering if this is related.

leetrout · on June 9, 2020

Hugops.

Hope they get a root cause and a quick fix. I’m not a fan of their cloud service but I know people working on the outage and fix are stressed.

kitteh · on June 10, 2020

About a month ago their Northern Virginia region was down. All the BGP prefixes associated with it disappeared from the internet (routes withdrawn). This time (I went to check when someone mentioned it) they kept advertising, but all traffic went nowhere once it got into their network. Curious to see if there is an RFO released.

aiisjustanif · on June 10, 2020

I wish we had a record of this.

kitteh · on June 11, 2020

I do. I store all this stuff. Where should I put it?

nonines · on June 10, 2020

This looks related (smoking gun?) https://status.aspera.io/incidents/t9r03x71dxkl

>> A 3rd party network provider was advertising routes which resulted in our WW traffic becoming severely impeded.

rbanffy · on June 10, 2020

It can only be attributable to human error.

No IBM computer has ever made a mistake or distorted information. They are all, by any practical definition of the words, foolproof and incapable of error.

stevehawk · on June 9, 2020

guess they didn't learn from AWS and hosting their status pages (in particular their icons) in S3

bantec · on June 10, 2020

It’s a second significant issue for last year with IBM( absolutely inconsistent for critical infrastructure (we are FinTech)

cerw · on June 9, 2020

Been like that for last 1h, Network packet Sydney (GCP) to Sydney (IBM) 62% packet loss

ck2 · on June 10, 2020

even weather.com was down but someone broke ebay too

       Fastly error: unknown domain: www.ebay.com. Please check that this domain has been added to a service.

toast0 · on June 10, 2020

weather.com makes sense. IBM bought the weather channel a while ago, hosting is likely tied to IBM Cloud at this point (although it looks like it's fronted by Akamai)

vmh1928 · on June 10, 2020

IBM bought the technology part called the Weather Company. That's the part that gathers weather info from all over and makes it available.

The cable TV channel is still independent.

supernova87a · on June 10, 2020

Aha, I guess explains why Wunderground.com was out too.

pmarreck · on June 10, 2020

Imagine hosting your status page on a different domain

9nGQluzmnq3M · on June 10, 2020

DNS worked fine here, this was an infra issue.

nadavami · on June 10, 2020

It seems like the status page just came back up.

woakas · on June 9, 2020

Our site (ubidots.com) does not have a complete down, but the IBM network has a high latency.

someguy12321 · on June 10, 2020

heads be rolling tomorrow!

anon102010 · on June 10, 2020

A quick check of cloudflare's isbgpsafeyet page

IBM Cloud - unsafe

At least AWS signs their routes I think.

If you can't even sign your own routes - hard to have a ton of pity.

kortilla · on June 10, 2020

Signing routes doesn’t mean others reject unsigned routes. AWS is just as vulnerable to hijacking as anyone.