My side project StatusGator monitors status pages (including IBM's ill-fated page) and I'm seeing more than 10% of the nearly 800 services we monitor having an outage right now.
So it appears to affect anyone who depends on IBM Cloud.
I really wonder how people get value out of a meta status page when my experience is that status pages are often incorrect about what the actual status is. Whether they're manually updated, or it's a case of "your 9s are not my 9s", it seems like a compounded broken telephone problem.
Probaly it's great to have some very big picture overview. Both in scope ("all" the cloud, and both in time as in "all" time, and maybe there's even some value in looking at the correlation of these).
Maybe it helps with doing a sanity check before picking a provider. And, I guess, at a basic level it helps with accountability/transparency.
When we started broadcasting status checks we thought it would be a great way to let users know what's happening behind the scenes. Then it turned out a lot of our happenings were self-inflicted so rather than being less transparent by way of less posts, we just tweaked our status outputs. "We are experiencing an event in <SOME VERY HIGH LEVEL SERVICE>" x1042. After we perform our RCA then we may post a blog about it or a short PR blurb
The last time we went down I questioned out loud the point of the status page and the general consensus was for others to be able to reference our outage.
Thanks and that's a GREAT idea for some detailed analysis. I have been trying to make better use of the 5+ years of status page history stashed in the cupboard.
(You should contact the guy at BackBlaze who does their hard drive stats blog posts, and pay him to do this for you as a side hustle - and release this data analysis quarterly!)
So what are HNers using IBM Cloud for and where do you see that it has an edge over AWS offerings (where an overlap exists, obviously)?
(I figure either you’re in devops and you are putting out fires too busy to read this thread or you’re not and your work is halted because of the incident so you might have time to read and reply ;)
Price for hardware.
As a base price, their bare-metal gear was significantly cheaper than equivalent-specced AWS gear (if it was even possible to get something like that).
We managed to snag quite a few 'interesting' configurations of things at various times that you just couldn't get at all in AWS. Things like PCI SSDs, very large RAM configs, or High-Frequency low-core count CPUs.
Free international/regional transfer.
We took significant advantage of this to move data around. We'd replicate TBs of data around.
At various times management and dev teams would complain and say that we should move everything to AWS (or whatever cloud provider they'd just met with at a conference).
We consistently showed higher performance and lower cost by significant margins. On cost alone, we were paying a small fraction of what it'd cost on AWS, even after taking into consideration ways to reduce cost on AWS such as scaling, spot instances and reserved-instances.
I really would like to see an AWS memo which maps out common use cases and expected costs (selling points) vs actual use cases and actual costs (pain points).
I don't think there's one good answer for everyone that's going to be right.
I think the biggest issue is that far too many people assume that the AWS Savings stories are universally applicable, and that it's safe to assume AWS is going to be the cheap option.
I'm sure there are folks for whom AWS is the cheap option, but it wasn't at my last job, and it's not for my current one (even though they are using it).
More likely their service offerings are growing so fast that trying to make the pricing structure coherent is like a hyper-aggressive game of whack-a-mole.
We used Softlayer (rebranded to IBM Cloud, and affected by this) at my last job. For the most part, their service pretty much just works; clearly not today. :)
We had a couple thousand bare metal servers, and barely used any of their API stuff.
As with any facility, there were occasional issues with electrical transfer switches, core router failures, fiber cuts, etc. Stuff happens, but we got pretty good communication, and things got resolved in a reasonable amount of time. Service got noticeably worse after IBM, but we were already planning to move to our acquirers hosting, because that's what happens when you're acquired. Oh, and their load balancers had garbage uptime.
Bandwidth prices used to be pretty reasonable, but they've adopted AWS style obscene pricing. At least they still let you use the private network for free (including to other datacenters).
HN ran on a box at Softlayer until early 2018 or so. This makes me think that the title of this post (which was submitted as "IBM Cloud down as well as their status page which looks to be hosted there") could at some point have been "IBM Cloud down as well as their status page which looks to be hosted there as well as the forum where people post these things which also looks to be hosted there".
I've always imagined it as a big tower shoved under someone's desk. The side panel of the case is off because otherwise it overheats. On the screen there's a single maximized window of DrRacket. A post it note warns you not to quit or reboot the system.
If it was a BGP issue, it's not a problem of where the status page is hosted but instead that you just can't get to it via that name no matter where it's hosted, right?
If by "name" you mean hostname, not really. If you have a domain with multiple nameservers in multiple countries on multiple providers, and your site is similarly globally distributed, at least a few people on the internet will always be able to pull up your site. So at least some of your clients will be able to resolve a domain address from at least some of your nameservers and connect to at least some of your web servers. Geo-IP and Anycast are also really useful here.
edit: It's possible that you could take out an entire TLD and make it impossible to resolve domains on that TLD once all the cached records expire. But that kind of targeted attack would not be possible with a BGP error, unless it was a very specifically crafted BGP error happening over a very long period of time (weeks-months-years depending on the record TTLs).
ok, that makes sense but you'd still at least have: if the site is down due to BGP then so is the status page that is on the same domain.
I guess I'm just calling out the people who are making fun of them for having their status page dependent on the same hardware it's monitoring when it's not clear that's the case just because they are both down?
I would suppose if it's a different TLD domain, then it would be more likely to conclude that.
A status page should be a static site hosted on multiple providers in multiple regions with multiple nameservers. So, Amazon S3 hosted in 2 regions, Azure Storage hosted in 2 different regions, 2 different nameserver providers in 2 different countries using two different backend colo providers. Costs probably <$150/year and that will survive BGP outages, backhaul link outages, hosting provider outages, DNS outages.
I'm not going to make fun of them for their status page being down, but it certainly doesn't reflect well on the brand/products.
You guys were one of the best use cases for the SL model, which really hasn't changed in 10+ years. You had very few dependencies on the less-reliable (read: all of them) services inside the SL stack and mostly managed everything on box and in software. In a few POPs you guys were running about 50% of the total SL backbone bandwidth. There were a lot of sad panda hats when you guys started to transition away.
> There were a lot of sad panda hats when you guys started to transition away.
For us as well. It was so nice to have things work one day and the next and the next, although I guess they wouldn't have worked today.
Favorite firefighting moment was when wdc lost half the fiber in ~ 2014, and we had to move all of our traffic out, so that there was capacity. Our guy asked why we had to move? and your guy said something like 'Because if you guys move, we only need one customer to move.' :D
Yeah the move from FreeBSD to Linux wouldn't have been fun for you guys either. And yeah, the WDC POPs were some of the most overbuilt from a bandwidth perspective and that was almost entirely because of you guys. Pretty sure there's a Cisco sales rep enjoying a nice holiday home in Connecticut as a result of the growth you guys did.
I had a high bandwidth use case that SL filled when I was a kid - but we went through resellers 10TB/UK2, later 100TB after that was a thing, then they dropped SL and every one of SL's products became AWS-priced levels of insane per GB bandwidth.
The odd thing is, for half the price I could get SL service w/10TB from a reseller, while at list price I only got 1-2TB bandwidth, and sales absolutely would not budge on that. I wonder why.
Last job for me was also a few thousand bare metal servers at SoftLayer. Acquired and moved to that infrastructure instead. Wonder if its the same acquisition? :-)
FWIW - IBM Cloud today has basically no benefit over AWS, Azure or GCE or even against some of the smaller regional players like AliCloud. The notable exception would be if you need to run a bare metal solution and leverage their free backbone which is a pretty narrow use case these days. The main selling point previously was to stand up your own VMware environment but even that came with a laundry list of caveats (unsupported hardware, limited VLANs, non-flexible IP space) that made it painful to use. Today AWS is vastly more performant, flexible, reliable and has a bunch of useful services you don't get from IBM Cloud.
If we speak specifically about IBM Cloud vs AWS, we use the Natural Language Understanding API in IBM Cloud and as far as I know the equivalent AWS offering, Comprehend, doesn't provide named entity disambiguation nor links to knowledge graphs (IBM links to DBpedia).
Unfortunately, that API changes regularly and often in undocumented ways that causes breakages for customers. Its really a lot of fun to deal with when suddenly a bunch of automation breaks and it turns out an unannounced push fundamentally re-writes foundational API calls.
We have been leasing bare metal servers since the pre-IBM Softlayer days.
Over the past few years we have experienced quite a few network-related outages. Not usually to this extent, more generally a failure of some piece of network gear that takes out either backend or frontend traffic from a particular data center. We seriously priced out a migration to another provider recently, but in the end what held us back was cross-AZ transfer costs on AWS. We found it would raise our operating costs significantly, so the matter was dropped.
We were much happier with the service and support we received prior to the IBM acquisition.
I had originally signed up due to the availability and pricing of bare metal servers and the mixed Windows/Linux server offerings. Their windows server licensing was better than AWS and I didn't want to be on Azure for a variety of reasons.
Currently on them because we have an OpenVPN based infrastructure that is very challenging to migrate.
Lastly the majority of our customers are in the midwest or Texas, and the proximity of their Dallas DC was a huge performance win for us.
Rarely it is a just technical decision, usually money is the reason.
In small and mid size organizations the CSP gave better pricing, or they help with your sales etc
In large organizations - IBM/Oracle bundle their existing products currently being paid for any way, or account managers have great relationships with decision makers , the company already has signed up big multi year deals.
This is not just IBM, it applies to GCP/Azure/AWS as well.
I like OpenWhisk which is the basis for their serverless compute offering. Has orchestration/state-machine functionality that makes it superior to GCP Cloud Functions, and uses Docker containers which makes it more flexible than AWS Lambda.
I also really like CouchDB which IBM Cloudant is based on.
Is that enough for me to use IBM cloud? no. not really.
I suspect nobody really uses it outside of weird outsourced financial modelling/planning tools like TM1 and other apps people stopped wanting to manage themselves.
I work as a consultant with big enterprise companies and I can assure you big enterprise companies are using IBM very heavily. As well as Oracle and HP and other uncool tech companies.
We use Restream.io and Solar Winds Papertrail, both were down today, my guess is they use IBM Cloud itself or some rackspace that IBM's rented to other clouds, which is apparently typical at the edge of the major public cloud regions
All of Broadcastify's audio servers (hosted with Softlayer in their Dallas datacenter) are completely unreachable and down.
I'm going to wait a bit to see if we get a status update, otherwise we'll be spinning up instances on AWS to failover (which will be enormously costly for bandwidth)
So far? Pretty much no news, they're using Slack to communicate a bit. VPN access for everybody is broken or barely working, no internal ticketing as a result. As far as I can tell, private networking is mostly working between servers (at least, for my servers, ~60).
I remember I was at an IBM sponsored hackathon around 2015 where it was a requirement to use Bluemix. Over the course of the weekend the service went down for hours 3 times.
Literally this morning I was wondering what ever happened to it, like did it die a quiet death? Oh it rebranded to IBM cloud in 2017. Now this news.
I think there's an eponymous law named for this sort of thing.
That's funny. I've had the exact same experience with Bluemix at a Hackathon in the past. It was down for almost the entire weekend, screwing all the teams that didn't pivot early enough.
The timing on that incident was a weird coincidence for our team. We had just rolled out a bunch of production updates, and then 5 minutes later ALL our VM's went down and I freaked out and tried to diagnose what was happening and recover before I started noticing all the other people going down.
So. Hate to ask this. But what happened to your old posts that are now just "-"'s?
I mean I can guess.
It's just that the communication blackout while the incident was happening was not appreciated. And you stepping forward to let us know what was going on was very appreciated.
- 2020-06-10 02:19 UTC - RESOLVED - The network operations team adjusted routing policies to fix an issue introduced by a 3rd party provider and this resolved the incident
I generally do everything on AWS or GCP, with a little Azure sometimes for personal projects. In what world does IBM beat one of those three in anything? Generally curious - how they are able to stay competitive?
Their bare metal cloud offering (SoftLayer acquisition) was actually pretty good whenever I used it about 4 years ago. Wasn’t the most intuitive API or UI but you could get a bare metal server anywhere in the world in a few minutes.
When the wind blows in the right direction. Sometimes, your server would get stuck in provisioning for hours and only get „un-stuck” after creating a support ticket. Which, I kid you not, at one of the previous jobs, wd had automated in our provisioning popeline. Good times.
Their biggest thing going for them is 100% free dark fiber private network. You do have to pay for the bigger pipe (100mbps included for each server, a minimal upcharge for gigabit), but that's pretty much a rounding error.
Honest slightly cynical question: most probably someone inside the responsible team said some day that it would be very stupid to host the status page inside the same infrastructure being monitored, but they were probably ignored... what should that person do now? Say "toldya!" out loud in the postmortem meeting or simply shut up and move on because reality is that we are hired to do some stupid task and not to think for ourselves?
If such people raised concerns and they had been overridden they way you describe they would sadly have left long ago.
It is not that companies become consciously malicious or are incompetent to start with, it becomes a vicious cycle, as more and more poor management and engineering talent join, the good ones leave, and the cycle continues.
Acquisitions and merge stave off the slow slide into irrelevance for a while, till the best of the new guys leave too. Systemic cultural changes is very very hard to achieve in large organizations.
My professional experience tells me that the next question will be who decided for B given Y, then you answer it and then you have a target on your SRE back, I'm afraid. Remember that the trickle-down economics works only when the shit hits the fans and what trickles down is not money.
I've worked in these types of organizations before, and it's always counter-productive to making positive changes in the environment. You as an SRE can either make it better, or find an environment that's conducive to positive changes.
If you'd like an outside resource to suggest or read up on better postmortem practices, the Google SRE Book has a chapter [0] on postmortem culture. It's an amazing change of pace and a huge stress level improvement for us SREs.
Where are all of these managers out there running companies making decisions with literally no thought whatsoever? I've literally never seen them- I almost exclusively work with rational human beings who are able to justify decisions, and the few who aren't haven't been afforded any real power.
Mostly legacy industries. I don't want to indicate it's common. And typically gets weeded out by Director-level.
But I've certainly seen more Top50 companies than not who have at least a few Manager / Sr Manager-level folks, in charge of key teams who own the sole keys to necessary functions, who are happy to say no to anything they're ignorant of, without any impetus to learn about it.
I would mirror the attitude of the person who said no originally.
If they are receptive to feedback and clearly want to do better, I would be kind and explain why I had suggested it not be there in the first place and cite this as an example.
If they were being adamant or denying it was their fault, I'd probably be really quiet and just make subtle remarks about how it would have been better if they listened.
> I stumbled on it by accident. I was lazy and let the cert lapse, but then noticed that spam signups basically stopped. One day maybe I'll make a post about it with graphs, although I'm not sure I actually have the data.
This is intriguing. I'm going to remember this but I'm too anal about perfect A+ TLS and renewal is already fully automated these days anyway :-\
I wonder if one could setup their TLS stack to get this effect without the tradeoff...
My apologies for the limited nesting at the hn nestlimit
> You could probably get the same effect with a self signed cert. Although that wouldn't get you an A+ on TLS. :)
> Also, if y'all do this, it probably won't work because the spammers will start ignoring expired certs.
Yeah, even if you could find a way to deny the spammers via esoteric configuration, it'll just make them realize they forgot to turn off TLS validation anyway (which is clearly what they meant to do)
I stumbled on it by accident. I was lazy and let the cert lapse, but then noticed that spam signups basically stopped. One day maybe I'll make a post about it with graphs, although I'm not sure I actually have the data.
Read only at first, then write too. It's hard to do writes though because you have to guess at what the person intended.
If someone does 'DROP TABLE ec2.instances', what exactly are they trying to accomplish? Do they want to terminate every ec2.instance? Should we let them?
Questions like that make write access very difficult.
haha, yeah. very cool, though. would love to chat when you're ready to share if you're looking for feedback.. I've got some features to suggest that would (IMO) increase the value prop.
It happens a lot, when you have so much infrastructure and redundancy you think it is too big to fail. Then you lose S3 in US-East1 and break everything.
While that is a solid advice, still it raises the question: is there such organization which corrects itself by firing the incompetent and promoting the competent instead (the "toldyouso" guy in this case)?
Or we just simply accept and making it the norm that even the lowest level of organizational governance is corrupt?
I am serious about this, because how people perceive their own rights, their own roles, their own status, their own influence and their organization's wrongdoing will influence the attitude in the long run against each and every organization in society in my opinion.
I know that I was blowing the question out of proportion, but it bugged me to ask anyway.
I particularly like The Gervais Principle as an explanation of why incompetents are frequently the ones making corporate governance decisions. It's satire, but I think there are elements of truth in it.
I view this as growing pains that everyone has to learn the hard way. The bigger question is will they learn the lesson and how will they fix this for the next time?
Ideally, someone should write a postmortem with a timeline of what happened and recommended fixes. These would then be fixed, and nobody would be blamed. (This is called a "blameless postmortem".)
But whether you can get away with that depends on culture.
Honestly it's small compared to everything else. I'd rather leave than do told-ya-so though and put the story in the exit interview or reason for leaving.
I built a bunch of CloudWatch monitoring for an AWS stack, and duplicated critical monitoring using a 3rd party monitoring service as well. So, because the universe hates me, the 3rd party service migrated their hosting into AWS ~18 months later... :sigh:
I haven't been at IBM since 2001, but when I was there any suggestion like this would have been beaten down by multiple layers of the big grey cloud for even intimating that such a visible, key piece of IBM marketing material should be on a third party service.
They would have been told "great, we understand your concerns, please add a card to the tech backlog (or whatever local jargon is at that team) and we'll address it in the next sprint/dev cycle/quarter". That's the right way to acknowledge these complaints while preventing actual progress.
I managed to capture a traceroute on the lg.softlayer.com from seattle to my home near seattle, that went via London (networklayer to london, then telia back to seattle). lol
Looks like things are getting better now though; looking glass says seattle and dallas can get to me without going to crazy destinations. wdc is still icky.
If it was a BGP peer who normally sends you 3 prefixes with under a /20 in aggregate and they suddenly started sending you a whole table, or if you allowed a peer to send you a default route, then both of those are highly avoidable through session configuration and filtering.
If the route which caused the madness came in via a large settlement-free peer (like a big eyeball/access network) or a transit (which is probably giving you the whole table) that's entirely another story.
That wasn't my point though. If you normally take 3 prefixes and suddenly you are receiving 800k+ prefixes then that likely classifies as "unexpected" (and avoidable since you can define max prefixes accepted per session).
I wasn't suggesting you could, I was questioning whether you should. :-)
I'm here from the Sysdig infra team, and wanted to note that the timing was just a coincidence. Our maintenance was a pre-scheduled upgrade to our platform that was communicated in advance, and unrelated to any outage.
A better approach is to have it hosted on a different cloud platform. If you really care, you’ll set it up on a different domain and nameserver as well with a long lived redirect (cached on CDNs) from the usual status.example.com or example.com/status.
Their status page also seems integrated into their internal support ticketing system. It's not a traditional status page. They wanted to maintain a consistent garbage interface to keep it inline with the rest of their administrative service.
The most infuriating thing about this is the ZERO communication coming out of IBM Cloud. No emails. No updates to twitter. Status page down. Support lines clogged.
At least give me something I can point my customers at to show them this is not due to my incompetence.
It seems all their external network connections are down. I assume people will have to drive to the data centers to fix. I really want to see a post mortem on this outage.
Haha, amazon had the same problem a few years ago when they had fire in datacenter, their status checker page was hosted in the same building and was showing everything is fine, while 1000s of websites hosted on AWS were down.
There will have to be a post mortem on this. The convention is to be as transparent as possible as to what went wrong. This helps to let current customers know that you found the problem, and have put plans in place to make sure it doesn't happen again.
The purpose of the signalling here is two fold.
1) If convincing enough (with details), you can keep current customers from moving to a competitor.
2) It also lets new customers see how you actually handle a crisis. If they can manage the crisis well enough, then you can point to this instance to prove your technical knowhow to handle their needs.
If they don't tell anything, or aren't transparent, then they can expect a mass exodus of customers.
I wonder if that's a thing that would even cross a typical IBM-ers mind? It might just be me, but I get a very strong smell of "We're IBM! There's nowhere else for you to go!" from them...
I'm talking from experience. Most things do get post mortems, but there's a lot of crap they also don't give us post mortems for "because customer data." It's my number 1 complaint, and I fight with managers about this all the time. We have a ton of hypervisor problems, and a lot of networking issues (generally over private network) and they tend to get very very secretive about it.
I didn't use their hypervisors, but I've had a lot of experience troubleshooting their networks. They've gotten a lot better at proactive monitoring, but we used to occassionally find some private networking paths that were having trouble, and until we narrowed it down, it was hard to find. (I dunno, I guess you can't just ask all the routers if there are any ports with errors, but sure enough, when they found the right port, there was usually a huge error count, or something)
The key thing is each IP 5-tuple (peerA, peerB, protocol, portA, portB) will always take the same path over their network (most likely a different path for return packets, when A and B are switched), so in order to properly probe, you need to probe on a lot of of port combos, and once you find a broken combo, you need to run MTR on those ports, so you can give them the MTR that shows the issue.
Or, if you can, have your internode protocol run on multiple connections and drop connections that are showing issues, and let a different customer file the tickets :)
My top-level comment was trying to ask for any evidence whether the downtime was related to a wider network issue (eg. BGP, backbone) or if it was specific to (part of) the IBM cloud.
The link was just a data point, not evidence of anything in particular.
About a month ago their Northern Virginia region was down. All the BGP prefixes associated with it disappeared from the internet (routes withdrawn). This time (I went to check when someone mentioned it) they kept advertising, but all traffic went nowhere once it got into their network. Curious to see if there is an RFO released.
No IBM computer has ever made a mistake or distorted information. They are all, by any practical definition of the words, foolproof and incapable of error.
weather.com makes sense. IBM bought the weather channel a while ago, hosting is likely tied to IBM Cloud at this point (although it looks like it's fronted by Akamai)
So it appears to affect anyone who depends on IBM Cloud.