We are in dire need of something crowdsourced, or where someone like DataDog or other telemetry systems offer you the ability to share non sensitive metrics publicly for various cloud or SaaS systems that they publish.
Edit: y’all are amazing with these monitoring tools!
When I got hired at Amazon in 2001 we had a "gonefishin" page that was a static page that would be served in the event of an outage (this was before status pages, but it was kind of the same thing -- public acknowledgement of a major incident). The standard protocol was within minutes of a sev 1 to make a decision to display the GF page once it was confirmed that the whole site was down and then work to fix the issue.
By the time I left in 2006 that was no longer policy since reporters had setup monitoring for that page to detect outages and report on service availability so they just let it crash and return 500s or whatever the failure mode was. Optimize for making the job of external agencies doing reporting on their availability harder instead of easier.
> We are in dire need of something crowdsourced, or where someone like DataDog or other telemetry systems offer you the ability to share non sensitive metrics publicly for various cloud or SaaS systems that they publish.
I use uptimerobot to monitor a lot of endpoints that I depend upon but don't really control. Been burned by first party status pages way too many times.
I guess one could argue it isn't an outage since it only seems to have affected a subset of users. I got on a zoom call when this issue started and we had 3 of the 4 participants. Only one couldn't connect due to the issue.
But I do agree they should be able to monitor things better and show some sort of update on their status page as soon as possible.
I feel like this is almost worse. It would be awful if you were the only person who couldn't connect to a high-stakes meeting. At least if it happens to everyone, it's obvious that the problem is on Zoom's end.
If you stake your life’s happiness on pleasing morons (and the morons in this case are those that pretty much don’t immediately assume technical problems out of your control) - you’re pretty much guaranteed a bad time.
A couple of months ago, I finally landed a first-round job interview at a place where I've wanted to work for several years. The interview was conducted over Zoom.
What would have happened if Zoom had worked fine on their end, but I was randomly unable to connect? Perhaps it would have been fine—they would have been understanding, and we would have rescheduled for another day. Perhaps if they hadn't been understanding, I shouldn't have wanted to work for them anyway.
But, I don't know. I wanted to work for them, and I was competing with other candidates who presumably interviewed on different days. Hiring processes are inherently imperfect, and lots of things can be consciously or unconsciously treated as a red flag.
(And yes, lots of other things could have happened on the day of the interview. But I still find this scenario particularly scary to think about.)
> Hiring processes are inherently imperfect, and lots of things can be consciously or unconsciously treated as a red flag.
Exactly, so it’s weird to worry about a Zoom problem in particular. If anything it’s a little better now since most people are conditioned to think of technical problems as less likely the affected persons fault (that’s why I referred to the alternative as “morons”) - even if you left yourself plenty of time and did everything right and public transit fucked you over it was never a good look.
Whether you consider it an outage or not seems to be a political / PR thing these days. I used to work on a SaaS that relied on a handful of big customers to make payroll. If their favourite stuff stopped working, hell ensued. On the other hand Atlassian pretended nothing was going on for a while recently, because they could afford to lose 400 customers.
This has made me realize why companies like pingdom have a business. I've always wondered, in the sense that I couldn't quite understand why you'd pay for someone just to ping things and alert you of outages (this was early in my career)
But over the last 4 years specifically I not only understand it I can't imagine not having a service like it.
Disclaimer: I don't work for pingdom and my current company doesn't use their services, I have in the past, they're pretty good, but I'm just using them as an example here
I'm actually building a product to solve this. If anyone is interested in beta testing, we should be rolling this out in 2~3 weeks. Shoot me an email:
mbesto @ gmail service
Is it so hard? At my company, I set up a status page linked to a pinging service which automatically pings various endpoints every 5 minutes (as well as our 3rd-party dependencies) and automatically flags any problems if a ping does not respond.
The pinging service and status page, at our scale, is free and our status page is actually useful and automated for 90% of our stuff.
That works for a website, but doesn’t work for all outage types with a service like zoom. If the website/api is responding, but you can’t create new calls, for example, the outage wouldn’t be detected.
I think the problem might be less on the technical and more on the business-side of things.
Status pages that raise customers confidence in your service are good from a
marketing perspective.
Automatically publishing uptime data without human review might be bad from a marketing perspective, if you don't trust the engineering department to actually deliver or if your service depends on too many external dependencies.
I do this at home using LMNS, very useful to detect service failures as well as latency spikes. I ping my upstream ISP router, Google, Cloudflare, my DNS provider, and several others.
Note that as another commenter said this only tells if the server is up, not if the service is working properly or not. In this case it won't work since the Zoom website is loading fine but the meetings don't work properly.
The SLA would not depend on what the status page reported, but the actual downtime. If the script malfunctioned, you wouldn't need to pay out because it wasn't actual downtime contractually - and if it was actual downtime, I guess it makes it harder to squeeze around but that's only if someone is carefully taking legally admissible evidence from the first minute the status page reports (which screenshots alone don't always meet).
They aren't? Every company I've worked at pays through the nose for one of GTM/Zoom/BlueJeans/Lifesize/Chime. Zoom is definitely the least bad of them in my experience, not that I like it at all. Lack of customers willing to pay doesn't seem to have stopped there from being a bounty of terrible alternatives.
I thought WebRTC solved some of these problems, but seems like peer discovery would still need to be centralised in some way. Are there any protocols that would solve this? (excluding email or blockchain both of which would require additional extensions or standards to create a solution with smooth UX)
Unfortunately WebRTC provides only part of the features needed to build Web Conferencing app. The main issue is that p2p connections don't work for more 5 attendees in the same session. After that is better to have MCU or SFU.
Other features like recording a much more reliable if they are implemented server side.
At least one Zoom Enterprise splash page is showing the generic nginx error message, which initially made me think it was just our tenant since the other subdomains I tried would load the login screen. But sounds like that is not the case.
I think Google is a better option especially if you have corp gmail. I am not a fan of o365/teams so I do agree with you there. Just my grievances are rooted in UX