Just discovered this problem on my company website, thanks for the update.
--
Seems to be getting better now - intermittent 502's and 504's, the occasional load (speed is choppy though)
--
Back to near-instant load times, great job guys! I look forward to the blog entry on this (P.S. It might be worth hosting your status page elsewhere in the future, whilst this might not happen very often, it's when it does that your site needs to be working, the lack of redundancy here is startling.)
I'm starting to think it's questionable to depend on CloudFlare more than necessary, but they're still the best option for some things. (I'm a customer, but probably going to stop being a customer this week; I was mostly curious to test it out. Not really decided, though.)
1) The CloudFlare security model for SSL basically lets them MITM all your traffic. Probably not a big deal for SSLizing a normal website, or even for accepting credit cards), since they're a decent-sized US company with legal liability, although I'd be concerned about their internal security vs. your own internal security (since you're still fully exposed on your side, too -- it doesn't improve security, and can at best not be a source of new vulnerability).
2) Their DNS doesn't appear particularly redundant; it's just anycast in one big block. Using CloudFlare for DNS seems to be bad practice; you should use something else and cname to CF. Ideally something with multiple DNS servers either individually anycast or in at least two independent (probably anycast) netblocks.
3) Performance of the proxy service seems adequate in my experience but for sites with large amounts of overseas-source traffic, I've heard of people getting lots of suspected-bad-guy path. For a free forum like 4chan that's probably fine; for an e-commerce site, probably not.
Can you say more about "lots of suspected-bad-guy path"? (I do not use CloudFlare currently and am not intending to do so, but I do run an e-commerce site, and have not heard this term used before: the idea of an entire axis along which I have not been evaluating my infrastructure intrigues me.)
As an ordinary web user, if CloudFlare suspect you may be malicious (your IP address being on a spam blacklist seems to be enough) they will show you a warning that you should scan your PC for viruses and ask you to complete a captcha.
(They linked to a report on a spam blacklist claiming my IP address had been used to send spam during the past week. I'd had my IP address for about twelve hours. Thanks, TalkTalk!)
As a web site operator, you have the option of switching this behaviour off.
I saw their 'bad guy' page many times living in Costa Rica then Nicaragua. I didn't realize what it was until a friend started testing out CF and I got it on his site and then we realized what was going on.
That was over a year ago though, I assume things have improved as they got more and better data (and resources). I was in Costa Rica in January/February then again in August/September last year and didn't come across it I don't think.
I'm in Turkey at the moment and for the last couple weeks and haven't seen one at all, and that's including using an EC2 proxy because of the censorship here.
If they think you (as an end user accessing a CF site) are a source of DDoS or other abusive traffic, they push your traffic into a sandbox. You then have to answer a captcha before moving on; I assume the bounce rate is incredibly high among most users at that point.
My feels go out to the ops folk at Cloudflare. Mistakes happen no matter how many years of experience people bring in, or how much they're paid. We're all humans after all. It must be a pressurizing task to be responsible for potentially millions of dollars of losses during this downtime.
I hope the issue is resolved soon and if a person caused it, they're not in too much trouble.
I would hope that they would treat this problem as an opportunity to improve their systems instead of haranguing an employee. All systems fail, the key to success is to turn that failure around and make sure it doesn't happen (the same way) again.
If this was a human error, then change the process to make it difficult to do. If it was a software problem, then they have just found a new set of automated tests to write. If it was a vendor problem, contact the vendor to see how they plan to prevent this problem in the future. If it was a design problem, change the structure of the system to make this type of problem less likely.
Quite a few big names have had similar outages in the past but the ones that I forgive (Google, Amazon) are the ones that talk openly about the issue and outline the changes that they are making to fix them. I'm looking forward to CloudFlare's explanation.
BTW: My boss asked me this week for more information on Railgun and if we should change CDN providers. Railgun sounds incredible and I think it could really help our platform. How CloudFlare responds to this incident is critical to my decision to move forward with Railgun testing or to just forget it completely.
Something I always wondered - do you use Cloudflare simply for the CDN in respects to photos, or does 4chan also frequently become the target of DDoS and other external attacks?
We use CloudFlare for everything -- CDN, DDoS mitigation, Railgun (see below), et cetera. They serve ~1.5 petabytes of bandwidth on our behalf every month and proxy billions upon billions of requests. I'm a huge fan of the product and team, despite hiccups like this one.
Edit -- I am officially faster than carrier pigeon:
me: was i the first human to notify you guys? |
me: i caught it within the first 30-60 seconds i think |
me: because i have no life and never sleep |
CloudFlare Ops pal: yeah. you did.
If anyone wants to hire me to check their site instead of Pingdom, feel free to ping me!
It's expected that eventually things will go down, I've never once found a piece of software, server, person or traffic light which has performed at 100% for their entire lifetime, the best thing we can take from this however is response times.
Barely an hour from outage to being live again - I barely had time to fire an email to my colleagues and update our Twitter before it came back online - however, what's worrying here is that there was no fallback in place whatsoever, we will be hearing from the Cloudflare team soon hopefully to let us know what went wrong, but if they are half the company they purport to be, they will learn a lot from whatever went wrong today.
Human error is unacceptable. Online services should never have a single point of failure which can take their entire global presence offline. Not only because it minimizes the risk of catastrophic accidents, but also because hardware does occasionally break.
This is why every piece of kid should be mirrored with a redundant backup and why many businesses even have entire duplicated standby systems for such disasters. Even if that gear cannot support the entire infrastructure, it's usually enough to at least publish an official status page. Having to use Twitter to update users is just amateurish in my opinion.
No systems are immune to failure. No matter how much redundancy you have, chances are you have interdependencies you did not anticipate, and sooner or later run into failure scenarios that violate your expectations.
It's very well possible that Cloudflare messed up here, but to claim so categorically that "human error is unacceptable" is a bit of a joke. We build systems to withstand the risks we know about, and guess at some we don't.
But the number of possible failure scenarios we don't understand properly is pretty much infinite.
Losing DNS was fairly unforgivable. DNS as a protocol is designed to make it easy to deal with server and network outages (to the point of losing netblocks from the global routing tables). They added anycast DNS, which is great, but didn't split their DNS into multiple anycast netblocks.
I run a number of DNS servers myself. And yes, it was "fairly unforgivable" to mess that part up.
But what I was addressing was the blanket claim that human error is unacceptable. Anyone who runs a setup much larger than a calculator will deal with human errors - whether actual operational errors or human inability to engineer for resilience against all possible but unlikely scenarios - on a regular basis.
Some things should be harder than others to break, and DNS are amongst them. Cloudflare no doubt have plenty of lessons to learn. But so have everyone else.
> I run a number of DNS servers myself. And yes, it was "fairly unforgivable" to mess that part up.
So then you basically agree with my point <_<
> But what I was addressing was the blanket claim that human error is unacceptable. Anyone who runs a setup much larger than a calculator will deal with human errors - whether actual operational errors or human inability to engineer for resilience against all possible but unlikely scenarios - on a regular basis.
You're twisting my words and taking them out of context. I was saying that human error is an unacceptable excuse for the entire stack of a company the size of Cloudflare going off line. And I was saying that because redundancy systems should act as a "safety net" so that administrators can make human errors. I've lost count of the number of dumb mistakes I've made over the years, but each time I've been able to switch to a back up system while I've worked towards undoing my cock up. And you said yourself that a complete DNS outage was unacceptable, so clearly you and I are more or less on the same page regarding this.
This response indicates lack of responsibility to me.
Human failure is unacceptable when we know it happens, and should do everything possible to guard against it. You say "guess at some we dont". Well, we do know all about human failure.
All that sympathy for that bloke who "fired himself", because everyone here agreed that his potential human error should have been anticipated, but not so in this case? Seems to me we are applying different standards.
Most of all, what I dont like is universal get outs. Its reminds me of the worst lie of all: "Sorry Sir, its a computer error".
> You say "guess at some we dont". Well, we do know all about human failure.
You miss the point. We can continue to merely enumerate possible error scenarios until the heat death of the universe, and we will still miss some.
It is "human error" for an operations team to not put in place methods for ensuring their systems stay up within agreed parameters.
But the reality is that it is not even theoretically possible for us to engineer a system which can guarantee no downtime. Furthermore, no organization is willing to pay the bill to address even a relatively small fraction of the problems we can easily predict for the reason that many even relatively likely reasons are more expensive to protect against than is worth.
So to begin with, we can't prevent failure. And even if we could, what from the outside looks like human error is often internally a result of either intentional budgetary constraints, or unintended consequences of lack of resources.
It is not about lack of responsibility. It about dispelling the fantasy that there is someone who is guilty of not doing their job correctly behind every failure.
That is not to say that there might not have been unacceptable human errors in this specific case. But that is entirely besides the point.
> but not so in this case?
I thought it was pretty clear that my comment applied to the general statement that "human error is unacceptable", but perhaps not. I explicitly wrote "It's very well possible that Cloudflare messed up here" because I didn't want to speculate on whether the specific problems in this case before the causes were even known.
> Most of all, what I dont like is universal get outs.
They are not "universal get outs". Nobody is going to say it is acceptable if the error is caused by someone bringing coffee into the ops room and spilling it all over the single server, for example. But there is a vast range between someone who is grossly negligent and/or incompetent and who should bear the blame, and someone who is doing their job as best as is reasonable given the resources available to them, but who still makes mistakes or oversights or simply don't have the time or resources to address some reasonably unlikely issue that eventually happens to cause downtime.
> This response indicates lack of experience to me.
I could say the same in return. I've worked in a few data centres and they've all put redundancy at the forefront of their design. (even to the point of having multiple physical fibre paths for fail over).
> No systems are immune to failure. No matter how much redundancy you have
You have that backwards. No systems are immune to failure, which is why you have redundancy.
> chances are you have interdependencies you did not anticipate, and sooner or later run into failure scenarios that violate your expectations.
If the dependencies haven't been anticipated then someone isn't doing their job right. There's a reason why incident response / disaster recovery / business continuity plans are written. It should be someone's job to think up every "what if" scenario ranging from each and every bit of kit dying, all your staff winning the lottery and walking out the next day and even terrorist attacks. I've even had to account for what would happen if nukes were dropped on the city where our main data center was housed (though the answer to that was a simple one: nobody would care that our site went offline). It might sound clichéd, but people get paid to expect the unexpected and work out how to maintain business continuity.
> It's very well possible that Cloudflare messed up here, but to claim so categorically that "human error is unacceptable" is a bit of a joke. We build systems to withstand the risks we know about, and guess at some we don't.
This was their infrastructure failing. If you own and maintain the infrastructure then you have no excuse not to work out what might happen if each and every part of that infrastructure failed. (trust me, I have had to do this in my last two jobs - despite your accusations of my "lack of experience" ;) ).
> But the number of possible failure scenarios we don't understand properly is pretty much infinite.
You're confusing cause and effect. The number of different causes for failure is infinite. But the effect is finite. For example: a server could crash for any number of reasons (hardware, software, user error, and all the different ways within those categories), but the end result is the same; the server has crashed. Thus what you do is plan for situations when different services fail (staff do not turn up for work, your domain name services stop responding, etc) and plan some kind of redundancy around that, thus giving a little more breathing time for engineers to fix the issue and with the minimum possible disruption to your users. As Cloudflare had to resort to Twitter to update their users, they completely failed every possible aspect of such planning. And given the high profile sites that depend on Cloudflare, they have no excuses.
If this happened in any of the other companies I worked for, I'd genuinely be fearful for my job as a crash of that magnitude would me that I hadn't done my job properly.
As someone who hosts hundreds of PAID sites with CloudFlare this is pretty unacceptable. I'm giving them thousands of dollars so that this doesn't happen. Will probably be moving off unless they have some very good reasoning behind a world-wide shutdown of a geo-redundant service...
Fair enough. But then what is the point of paying for enterprise support if all you get is the same generic response a normal punter gets? If they cant offer better, then enterprise support seems poor value at best.
Are we sure that if CloudFlare wasn't so popular here, the word scam wouldn't be used?
So you're saying you want them to say the same thing but reword it since you're enterprise? If they don't know the root cause, they don't know the root cause.
My startup NameTerrific can support instantaneous DNS updates in a geo-redundant Anycast infrastructure. As long as your TTL is sufficiently low (<300), the impact is quite limited as propagation time is negligible at NameTerrific.
EDIT: Sorry guys. We got some issues with a gem after installing the recently updated ruby2.0.0p0. The unicorn workers were timing out. TerrificDNS is completely unaffected and the site is already running again.
Well, we have already soft launched our own TerrificDNS Anycast and it has replaced the Route 53 solution. TerrificDNS platform is running on Redis + PowerDNS.
Would love some details on how to do this, I've been a (business) CF customer for a while now and have never seen this service and I haven't turned up anything googling/scouring their website for it just now.
Yep, this is definitely something I'll do in the future and I guess it was my fault for trusting a single service... even one that is designed around keep your site up/redundant.
In my experience, cloudflare has been little more than a scam for anyone with half decent traffic.
Not really surprised. Funny how the status page shows all green (are those just static button?) while they acknowledge there is an issue and that they don't know what's going on.
"2500% guarantee
This extended Service Level Agreement guarantees 100% uptime, and adds a multiplier to owed service credits resulting from any lapse: 5 times any downtime minutes and 5 times customers affected = 2500% guarantee."
I wish there were a business model for a third party site-monitor and site-uptime service, which let the site owner do more than just post updates, but also prevented the site owner from lying about historical data.
Basically New Relic (that actually worked) + Pingdom + Internet Archive + Twitter + status.example.com.
Doesn't really matter -- as long as the odds outages at both sites are independent, and relatively low, it won't happen at both at the same time, usually. You can make the status service pretty reliable, cheaply, compared to most other services, and if it loses the central servers, the remote monitoring nodes can still store test information, so when the service comes back up, the historical record should be accurate.
I use a pretty major forum that has a huge amount of traffic. The owner migrated it to CloudFlare. For the past 5-6 weeks the site has 50% of its request go to a 'Sorry xyz is not available right now'
You should make sure that the site is not actually returning 500's for those request. We had some similar issues when we first started using the service
Somebody pointed out about the CNAME available on Cloudflare. I never knew that and i checked out the article. The First paragraph:
"CNAME setup is a manual process generally available to paid CloudFlare plans only. If you are interested in testing CNAME setup, please contact CloudFlare first with the domain you would like to test CNAME with. Please specifically mention CNAME Setup in the subject field for faster review. Allowing for CNAME setup is entirely at the discretion of CloudFlare."
So NO: This isn't even a features at all. They made it as hard as possible to set this up and will grant you the use of it as they like.
It seems that CloudFlare's DNS is down, and affecting NameTerrific as we have a CNAME record pointing to them. I had to change the CNAME record to get our site working again.
EDIT: Based on Twitter search, all CloudFlare sites seem to be down.
That part surprised me. I thought it was common for ISPs to not use their own services for critical web presence, i.e. places users might visit when it's down.
Even their status page is down. And sure - all my sites too. And funny part is that for most of my sites i have stopped Clouflare features and use just their DNS. Never thought that I will fail because of DNS not being available.
btw - sites are back. I hope they will not go down.
Why I have switched off services? Because once i enabled them sites got slower, not faster. Sure - I doubt that most users noticed that, but for example, if i checked Pingdom or Google Crawl Stats - 'Time spent downloading a page' situation was very clear. With Cloudflare it took Google 2x more time to download page than without. I had no time to investigate why thats so, but after switching Clouflare off i was again back to 500ms.
Edit: Sites are not back. I guess that work the ones for which i have DNS cached. Lets wait...
When I briefly did that test, I didn't find much difference on Pingdom; but I like some of their other services and I assume they can handle burst activity at least as well as my own server. So I'm a big fan overall and keep using them.
Akamai isn't as accessible as Cloudflare. For one example I cannot even see how much they charge for similar services, I'd likely have to get a "quote" which means that the whole thing is out of budget for most small-medium businesses.
Akamai's whole web-presents is clearly aimed at enterprise class customers. I mean there are whitepapers everywhere, the site is filled with business nonsense and I cannot even get started without at least a phone call or five.
Maybe people "jump on" to new companies because those companies actually offer a product old companies are not? I can go to Cloudflare right now and sign up, I can give them my CC info and pay, and everything is up-front and easy. I can be online with Cloudflare within the hour (or whenever DNS moves).
I have been using CDNs since I was a very small business (CacheFly, EdgeCast, and now CDNetworks), and if you are seriously using CloudFlare because you haven't even tried setting up a phone call with Akamai due to some kind of mental aversion, you are doing your business a disservice. (I mean, seriously: is the 20 minutes you might waste really that valuable? What's the harm... it might even go well!)
Service sales 101 - if a business model contains manually/individually quoting every potential customer, then your prices cannot possibly be reasonably low, as your sales model requires enterprise-level prices to pay for your salesmen.
It's not that I can't spend 20 minutes on call - but if your customer acquisition costs involve at least 20 minutes of salesman-time per a possible lead (thus, at least 200 minutes of salesman-time for even the smallest sign-up), then I'm very sure that whatever number they will quote - that number will not be something that I can or want to afford. Let them pester some 'enterprises' with that.
Again, I have been using CDNs for a while now, including one that is much larger than CloudFlare (CDNetworks), and on various occasions I have talked with Akamai to get quotes. In my book, CloudFlare reams medium-sized companies on price, and they can do that precisely because people such as yourself have miscalibrated your "I'm getting screwed" detector. (The same is true of Amazon CloudFront, btw.) In essence, once you realize people have this algorithm (specifically, where they don't even bother calling a sales person) it is trivial to exploit you by quoting a large price that you will never do comparison on (and then, rather than compete on price, you end up competing on a ton of questionable value-adds, and do price discrimination on things like 24/7 support).
Well, I'd notice if I'm getting any invoices worthwhile to haggle about? If I'm paying $200 per month then would Akamai even bother to give me a quote?
And if I'm paying $20.000/yr for anything, then sure, it's good business sense for a medium business to reconsider suppliers at least yearly even if you're very happy with the current one - just to shop around and verify prices.
You should avoid Akamai at all costs. They treat you like dirt under their fingernails until you have a 6-figure contract. And even then the experience remains nothing short of surreal...
Report a service problem? Welcome to the indian support-center, where you will be slowly spoon-fed
the next best canned response that may or may not match your problem. Escalating to anyone with half a clue is near impossible.
Need a quote? Wait weeks at a minimum for someone to get back to you.
Oh, did we break your reporting? Of course this is your fault, Akamai doesn't make mistakes. Now talk to the hand.
If you're in the market for a CDN then there's plenty candidates who will sell you a better experience for less money (Level3, EdgeCast, BitGravity, etc.).
OTOH if you're paying enough, you get really good service and support (I have a bunch of friends who work there who say exactly what you're saying), so figuring out a way to use them through a reseller is actually a fairly legitimate choice.
I have to wonder how much is "enough" then, our account wasn't that small when we finally ditched Akamai.
Meanwhile none of the other CDNs has given us such a terrible experience. It's almost as if they care, imagine that!
When you consider that the performance differences are at best marginal (the only real differences exist on mobile and in emerging markets, and Akamai is not on top of that) there's really no good reason to put up with the spoiled brats of Akamai in this day and age.
At that $3k/mo level, you can definitely talk to CDNetworks (the CDN I currently use), which is sandwiched between two orders of magnitude of scale, CloudFlare on one side and Akamai on the other. (That said, CDNetworks seems to be much better positioned with regards to China than Akamai.) (That said, I'm actually pretty certain that Akamai would talk to you at the $3k/mo level: have you even tried calling them?)
They'd talk to you if you were a startup and $3k, but probably not so much (directly) if there wasn't growth potential. There are hosting providers who resell Akamai for smaller customers, though. (The only time I ever cared about high-end CDN involved businesses Akamai wouldn't serve, though.) I haven't gone through the normal sales route with them, but I know lots of internal Akamai people in security/ops/etc. and small customers are not really their market.
CDN itself is essentially a commodity; it's not too hard to keep multiple CDNs in rotation. There are probably 20+ big CDNs worth consideration and another bunch of resellers. (Amazon CloudFront, BitGravity, Level3, Limelight are probably the first ones I'd think of for smaller sites; Akamai is still the undisputed king for top performance.)
DNS is the thing which is more interesting to me.
I'd probably go with Route53 for cheap good anycast DNS right now; everyone else seems to either be a clown or super expensive (or bundled with other expensive DNS service). Ultimately I guess I'll end up doing internal DNS. (non-anycast DNS is also a total commodity, but good anycast dns not as much) DynECT also looks pretty good. Not sure what other anycast DNS providers there are in the <$500/zone/mo range.
> There are hosting providers who resell Akamai for smaller customers, though.
There are also many other CDNs that exist in the massive territory between CloudFlare and Akamai (such as CDNetworks, the company I had mentioned).
> CDN itself is essentially a commodity; it's not too hard to keep multiple CDNs in rotation.
For latency-insensitive use cases in generally centralized territory, I agree that CDNs are "essentially a commodity". The correct strategy would seem to be to call a number of them, and negotiate a good deal, not to assume that the one that has a printed sticker price is somehow the right choice (as some people here seem to have been doing ;P).
However, to make the counter-point to this: the cache hit ratio that is being reported by CloudFlare for evasi0n.com (note: I do not have control over that site's hosting; that's choice was due to planetbeing and pod2g) is 81% <- this is for a static single-page information site. How various CDNs handle caching, whether they cache you on disk or in RAM, what they do with regards to hot connections or pre-fetching... these all have massive performance implications on your website.
It's a totally reasonable thing for a person who is busy to "satisfice" on many priorities, vs. optimize. Maybe CloudFlare isn't optimal, but if I can get a price and sign up in minutes, and it's good enough, that might be the right choice. It's not just the time; it's that talking to a salesperson is usually psychologically draining. You'll never be able to pick up a phone and get a price in a few minutes; it's always "where is your business located", "x is the rep", "x will call you back", etc. It turns into a fiasco. You end up having CDN sales reps come to your office to meet with you to "understand your requirements". etc.
Punishing "old-school enterprise sales tactics" which try to keep price from being transparent is a reasonable choice. If you're a big content site, yes, you should go through the effort, but for someone who just wants a small service, buy from people who publish their prices.
CloudFlare isn't the only CDN which publishes pricing -- CloudFront with AWS is very transparent. Rackspace Cloudfiles is transparent. BitGravity is fairly transparent. Cachefly. etc.
Akamai is the worst at this, but Level3, CDNetworks, and Limelight don't publish pricing either.
Offering a free service like CF does is the genius of the freemium model -- even if your service is more expensive or less suitable at the high end, people who start out because it's free and easy will often stick with you as long as you're "good enough" as they grow.
I find it interesting that you bring up CloudFront, because they are also very expensive. As far as I can tell, because there are so many people of there who have a mental aversion to talking to another human and negotiating, they can charge an insane premium on an "engh" service.
Regardless, if picking up the phone and negotiating with a CDN, someone whose opinion of you is totally irrelevant and where the worst-case outcome is "we won't do business with you", how are you going to handle support on your own product, or court investors of your company?
That's irrelevant. A free hobby site might be worth spending $200/mo on CDN for, but I'm not going to drop the $10-30k/mo to get an $AKAM sales rep to call me back.
If you're doing mostly static data which rarely changes, you can probably get very high hit rates on CloudFlare, and it's cheaper than even crappy $750/Gbps/mo colo bandwidth then.
I'm so happy we didn't go on that wave. Redirecting your DNS to someone else seems like a bad idea in any case. In any case, what do they do, that I could not have done with Varnish?
They are only sort of a CDN; in addition to DNS, they specialize in site optimization via content transformation (a la mod_pagespeed in the cloud) and "DDoS protection" (which is pretty much them replacing your website for new users with some JavaScript that tries to determine if you are a legitimate client).
They don't promise to cache much, and they in fact don't: even on very simple single-page static information websites, such as evasion.com, they have an abysmally poor 81% cache hit ratio. They don't help at all with dynamic content due to having poorly located nodes and lots of heavyweight code running in their proxy. Their lack of many nodes in good positions (compared to something like CDNetworks or Akamai, they are one or two orders of magnitude smaller) also means they can't provide very good latency even for the times when they actually happen to have something in cache.
(Note: if someone is now going to say CDNs don't generally do well with dynamic content, they are wrong: normal CDNs actually improve the performance of dynamic content incredibly by maintaining large-window pre-connected HTTP sessions to customer origin servers, often over private networks that already provide better bandwidth: you can easily see 2x latency improvements with a normal CDN even for fully dynamic content).
So, they really shouldn't be compared with a "CDN": they have an interesting service that actually provides something valuable for many key use cases (4chan comes to my mind: in essence, something that is actually likely to experience a true DDoS attack sufficiently often and with sufficiently little provocation that it makes sense to add an external system to your infrastructure), but if you need a "CDN" there are many more reasonable alternatives that don't have as many moving parts and are thereby going to break much less often (and, if they actually do, should break only in localized regions).
its sunday! seems like they pushed another faulty update (like last time)! yep confirmed, all is down including their own site, thats pretty fucked up, when they dont even have offsite status! good thing i dont use cloudflare in all my sites...
I'm getting alternating 504 and 502 errors for the main website. The server signature is "cloudflare-nginx", so it's definitely getting through, though.
I have no idea what is wrong, or how long they will take to fix it, but I'd imagine that CloudFlare has significantly better network engineers than the average company, and so they will fix it in far less time than the average company would fix the same problem.
But, for the amount of money many paid customers are paying them (in essence, anyone at that $3k/mo level that includes the critical 24/7 phone support), you can actually get an account with a company like CDNetworks or Akamai (if nothing else, with a reasonable CDN like EdgeCast) and have still-better network engineers than CloudFlare.
Also, even if you are using them for free: they aren't replacing people you have in house... they are an additional component that can independently fail, in addition to any of the things that would have caused your average company's network engineers to fail. They don't promise to cache enough content to replace much of your infrastructure.
“While we have not completed our investigation, we believe this incident was triggered by a product issue that Juniper identified last October, when a patch was also made available"
Good network engineers tend to apply newly release patches. This vulnerability was documented for almost half a year...
- There is a global problem that affects the CloudFlare proxy and DNS services.
- The problem appears to be due to bad routing.
- We are working to restore correct routes in order to bring both DNS and proxy services back online.
- The operations and networking team are all online and treating this as an emergency.
- We do not have an ETA on the response time but will continue to post updates via Twitter as we learn more.
UPDATE. Sites are being restored now. DNS is operating.
I don't know all the details as bugging the network team while they were fixing wasn't going to help. We'll get a postmortem blog post up.