CloudFlare was down

jgrahamc · on March 3, 2013

https://twitter.com/CloudFlareSys/status/308154786316963841

- There is a global problem that affects the CloudFlare proxy and DNS services.

- The problem appears to be due to bad routing.

- We are working to restore correct routes in order to bring both DNS and proxy services back online.

- The operations and networking team are all online and treating this as an emergency.

- We do not have an ETA on the response time but will continue to post updates via Twitter as we learn more.

UPDATE. Sites are being restored now. DNS is operating.

I don't know all the details as bugging the network team while they were fixing wasn't going to help. We'll get a postmortem blog post up.

_p62c · on March 3, 2013

Just discovered this problem on my company website, thanks for the update.

--

Seems to be getting better now - intermittent 502's and 504's, the occasional load (speed is choppy though)

--

Back to near-instant load times, great job guys! I look forward to the blog entry on this (P.S. It might be worth hosting your status page elsewhere in the future, whilst this might not happen very often, it's when it does that your site needs to be working, the lack of redundancy here is startling.)

rdl · on March 3, 2013

I'm starting to think it's questionable to depend on CloudFlare more than necessary, but they're still the best option for some things. (I'm a customer, but probably going to stop being a customer this week; I was mostly curious to test it out. Not really decided, though.)

1) The CloudFlare security model for SSL basically lets them MITM all your traffic. Probably not a big deal for SSLizing a normal website, or even for accepting credit cards), since they're a decent-sized US company with legal liability, although I'd be concerned about their internal security vs. your own internal security (since you're still fully exposed on your side, too -- it doesn't improve security, and can at best not be a source of new vulnerability).

2) Their DNS doesn't appear particularly redundant; it's just anycast in one big block. Using CloudFlare for DNS seems to be bad practice; you should use something else and cname to CF. Ideally something with multiple DNS servers either individually anycast or in at least two independent (probably anycast) netblocks.

3) Performance of the proxy service seems adequate in my experience but for sites with large amounts of overseas-source traffic, I've heard of people getting lots of suspected-bad-guy path. For a free forum like 4chan that's probably fine; for an e-commerce site, probably not.

saurik · on March 3, 2013

Can you say more about "lots of suspected-bad-guy path"? (I do not use CloudFlare currently and am not intending to do so, but I do run an e-commerce site, and have not heard this term used before: the idea of an entire axis along which I have not been evaluating my infrastructure intrigues me.)

mooism2 · on March 3, 2013

As an ordinary web user, if CloudFlare suspect you may be malicious (your IP address being on a spam blacklist seems to be enough) they will show you a warning that you should scan your PC for viruses and ask you to complete a captcha.

(They linked to a report on a spam blacklist claiming my IP address had been used to send spam during the past week. I'd had my IP address for about twelve hours. Thanks, TalkTalk!)

As a web site operator, you have the option of switching this behaviour off.

benologist · on March 3, 2013

I saw their 'bad guy' page many times living in Costa Rica then Nicaragua. I didn't realize what it was until a friend started testing out CF and I got it on his site and then we realized what was going on.

That was over a year ago though, I assume things have improved as they got more and better data (and resources). I was in Costa Rica in January/February then again in August/September last year and didn't come across it I don't think.

I'm in Turkey at the moment and for the last couple weeks and haven't seen one at all, and that's including using an EC2 proxy because of the censorship here.

rdl · on March 3, 2013

If they think you (as an end user accessing a CF site) are a source of DDoS or other abusive traffic, they push your traffic into a sandbox. You then have to answer a captcha before moving on; I assume the bounce rate is incredibly high among most users at that point.

ultimoo · on March 3, 2013

My feels go out to the ops folk at Cloudflare. Mistakes happen no matter how many years of experience people bring in, or how much they're paid. We're all humans after all. It must be a pressurizing task to be responsible for potentially millions of dollars of losses during this downtime.

I hope the issue is resolved soon and if a person caused it, they're not in too much trouble.

Corrado · on March 3, 2013

I would hope that they would treat this problem as an opportunity to improve their systems instead of haranguing an employee. All systems fail, the key to success is to turn that failure around and make sure it doesn't happen (the same way) again.

If this was a human error, then change the process to make it difficult to do. If it was a software problem, then they have just found a new set of automated tests to write. If it was a vendor problem, contact the vendor to see how they plan to prevent this problem in the future. If it was a design problem, change the structure of the system to make this type of problem less likely.

Quite a few big names have had similar outages in the past but the ones that I forgive (Google, Amazon) are the ones that talk openly about the issue and outline the changes that they are making to fix them. I'm looking forward to CloudFlare's explanation.

BTW: My boss asked me this week for more information on Railgun and if we should change CDN providers. Railgun sounds incredible and I think it could really help our platform. How CloudFlare responds to this incident is critical to my decision to move forward with Railgun testing or to just forget it completely.

moot · on March 3, 2013

Amen.

_p62c · on March 3, 2013

I guess you're here because 4chan is down?

Something I always wondered - do you use Cloudflare simply for the CDN in respects to photos, or does 4chan also frequently become the target of DDoS and other external attacks?

moot · on March 3, 2013

We use CloudFlare for everything -- CDN, DDoS mitigation, Railgun (see below), et cetera. They serve ~1.5 petabytes of bandwidth on our behalf every month and proxy billions upon billions of requests. I'm a huge fan of the product and team, despite hiccups like this one.

More on Railgun: http://arstechnica.com/information-technology/2013/02/cloudf...

Edit -- I am officially faster than carrier pigeon:

me: was i the first human to notify you guys? | me: i caught it within the first 30-60 seconds i think | me: because i have no life and never sleep | CloudFlare Ops pal: yeah. you did.

If anyone wants to hire me to check their site instead of Pingdom, feel free to ping me!

bowmessage · on March 3, 2013

Even the creator of 4chan has to have his own 'first' moments once in a while :)

moot · on March 3, 2013

What's the opposite of slowpoke.jpg?

_p62c · on March 3, 2013

It's expected that eventually things will go down, I've never once found a piece of software, server, person or traffic light which has performed at 100% for their entire lifetime, the best thing we can take from this however is response times.

Barely an hour from outage to being live again - I barely had time to fire an email to my colleagues and update our Twitter before it came back online - however, what's worrying here is that there was no fallback in place whatsoever, we will be hearing from the Cloudflare team soon hopefully to let us know what went wrong, but if they are half the company they purport to be, they will learn a lot from whatever went wrong today.

_p62c · on March 3, 2013

Regarding the Edit:

How much would you charge to ping my site 24 hours per day? ^^

laumars · on March 3, 2013

Human error is unacceptable. Online services should never have a single point of failure which can take their entire global presence offline. Not only because it minimizes the risk of catastrophic accidents, but also because hardware does occasionally break.

This is why every piece of kid should be mirrored with a redundant backup and why many businesses even have entire duplicated standby systems for such disasters. Even if that gear cannot support the entire infrastructure, it's usually enough to at least publish an official status page. Having to use Twitter to update users is just amateurish in my opinion.

vidarh · on March 3, 2013

This response indicates lack of experience to me.

No systems are immune to failure. No matter how much redundancy you have, chances are you have interdependencies you did not anticipate, and sooner or later run into failure scenarios that violate your expectations.

It's very well possible that Cloudflare messed up here, but to claim so categorically that "human error is unacceptable" is a bit of a joke. We build systems to withstand the risks we know about, and guess at some we don't.

But the number of possible failure scenarios we don't understand properly is pretty much infinite.

rdl · on March 3, 2013

Losing DNS was fairly unforgivable. DNS as a protocol is designed to make it easy to deal with server and network outages (to the point of losing netblocks from the global routing tables). They added anycast DNS, which is great, but didn't split their DNS into multiple anycast netblocks.

vidarh · on March 3, 2013

I run a number of DNS servers myself. And yes, it was "fairly unforgivable" to mess that part up.

But what I was addressing was the blanket claim that human error is unacceptable. Anyone who runs a setup much larger than a calculator will deal with human errors - whether actual operational errors or human inability to engineer for resilience against all possible but unlikely scenarios - on a regular basis.

Some things should be harder than others to break, and DNS are amongst them. Cloudflare no doubt have plenty of lessons to learn. But so have everyone else.

laumars · on March 3, 2013

> I run a number of DNS servers myself. And yes, it was "fairly unforgivable" to mess that part up.

So then you basically agree with my point <_<

> But what I was addressing was the blanket claim that human error is unacceptable. Anyone who runs a setup much larger than a calculator will deal with human errors - whether actual operational errors or human inability to engineer for resilience against all possible but unlikely scenarios - on a regular basis.

You're twisting my words and taking them out of context. I was saying that human error is an unacceptable excuse for the entire stack of a company the size of Cloudflare going off line. And I was saying that because redundancy systems should act as a "safety net" so that administrators can make human errors. I've lost count of the number of dumb mistakes I've made over the years, but each time I've been able to switch to a back up system while I've worked towards undoing my cock up. And you said yourself that a complete DNS outage was unacceptable, so clearly you and I are more or less on the same page regarding this.

alan_cx · on March 3, 2013

This response indicates lack of responsibility to me.

Human failure is unacceptable when we know it happens, and should do everything possible to guard against it. You say "guess at some we dont". Well, we do know all about human failure.

All that sympathy for that bloke who "fired himself", because everyone here agreed that his potential human error should have been anticipated, but not so in this case? Seems to me we are applying different standards.

Most of all, what I dont like is universal get outs. Its reminds me of the worst lie of all: "Sorry Sir, its a computer error".

vidarh · on March 3, 2013

> You say "guess at some we dont". Well, we do know all about human failure.

You miss the point. We can continue to merely enumerate possible error scenarios until the heat death of the universe, and we will still miss some.

It is "human error" for an operations team to not put in place methods for ensuring their systems stay up within agreed parameters.

But the reality is that it is not even theoretically possible for us to engineer a system which can guarantee no downtime. Furthermore, no organization is willing to pay the bill to address even a relatively small fraction of the problems we can easily predict for the reason that many even relatively likely reasons are more expensive to protect against than is worth.

So to begin with, we can't prevent failure. And even if we could, what from the outside looks like human error is often internally a result of either intentional budgetary constraints, or unintended consequences of lack of resources.

It is not about lack of responsibility. It about dispelling the fantasy that there is someone who is guilty of not doing their job correctly behind every failure.

That is not to say that there might not have been unacceptable human errors in this specific case. But that is entirely besides the point.

> but not so in this case?

I thought it was pretty clear that my comment applied to the general statement that "human error is unacceptable", but perhaps not. I explicitly wrote "It's very well possible that Cloudflare messed up here" because I didn't want to speculate on whether the specific problems in this case before the causes were even known.

> Most of all, what I dont like is universal get outs.

They are not "universal get outs". Nobody is going to say it is acceptable if the error is caused by someone bringing coffee into the ops room and spilling it all over the single server, for example. But there is a vast range between someone who is grossly negligent and/or incompetent and who should bear the blame, and someone who is doing their job as best as is reasonable given the resources available to them, but who still makes mistakes or oversights or simply don't have the time or resources to address some reasonably unlikely issue that eventually happens to cause downtime.

laumars · on March 3, 2013

> This response indicates lack of experience to me. I could say the same in return. I've worked in a few data centres and they've all put redundancy at the forefront of their design. (even to the point of having multiple physical fibre paths for fail over).

> No systems are immune to failure. No matter how much redundancy you have

You have that backwards. No systems are immune to failure, which is why you have redundancy.

> chances are you have interdependencies you did not anticipate, and sooner or later run into failure scenarios that violate your expectations.

If the dependencies haven't been anticipated then someone isn't doing their job right. There's a reason why incident response / disaster recovery / business continuity plans are written. It should be someone's job to think up every "what if" scenario ranging from each and every bit of kit dying, all your staff winning the lottery and walking out the next day and even terrorist attacks. I've even had to account for what would happen if nukes were dropped on the city where our main data center was housed (though the answer to that was a simple one: nobody would care that our site went offline). It might sound clichéd, but people get paid to expect the unexpected and work out how to maintain business continuity.

> It's very well possible that Cloudflare messed up here, but to claim so categorically that "human error is unacceptable" is a bit of a joke. We build systems to withstand the risks we know about, and guess at some we don't.

This was their infrastructure failing. If you own and maintain the infrastructure then you have no excuse not to work out what might happen if each and every part of that infrastructure failed. (trust me, I have had to do this in my last two jobs - despite your accusations of my "lack of experience" ;) ).

> But the number of possible failure scenarios we don't understand properly is pretty much infinite.

You're confusing cause and effect. The number of different causes for failure is infinite. But the effect is finite. For example: a server could crash for any number of reasons (hardware, software, user error, and all the different ways within those categories), but the end result is the same; the server has crashed. Thus what you do is plan for situations when different services fail (staff do not turn up for work, your domain name services stop responding, etc) and plan some kind of redundancy around that, thus giving a little more breathing time for engineers to fix the issue and with the minimum possible disruption to your users. As Cloudflare had to resort to Twitter to update their users, they completely failed every possible aspect of such planning. And given the high profile sites that depend on Cloudflare, they have no excuses.

If this happened in any of the other companies I worked for, I'd genuinely be fearful for my job as a crash of that magnitude would me that I hadn't done my job properly.

overshard · on March 3, 2013

As someone who hosts hundreds of PAID sites with CloudFlare this is pretty unacceptable. I'm giving them thousands of dollars so that this doesn't happen. Will probably be moving off unless they have some very good reasoning behind a world-wide shutdown of a geo-redundant service...

robotkad · on March 3, 2013

Have you called the enterprise support number? What was their response?

We are in the same boat, but guess where I have the phone number for enterprise support? Webmail on our domain. Lesson learned.

overshard · on March 3, 2013

The response is "sorry we are looking into it, we'll call you back when we know what's wrong"... pretty generic.

rurounijones · on March 3, 2013

Well to be fair, what do you expect them to say before they have diagnosed the root issue?

alan_cx · on March 3, 2013

Fair enough. But then what is the point of paying for enterprise support if all you get is the same generic response a normal punter gets? If they cant offer better, then enterprise support seems poor value at best.

Are we sure that if CloudFlare wasn't so popular here, the word scam wouldn't be used?

mef · on March 3, 2013

So you're saying you want them to say the same thing but reword it since you're enterprise? If they don't know the root cause, they don't know the root cause.

zhoutong · on March 3, 2013

CloudFlare offers CNAME option for paid customers. So you can use an enterprise DNS service and only point to CloudFlare via a CNAME record.

When disasters like this happen, a quick DNS change can be a life-saver.

foobar2k · on March 3, 2013

There's no such thing as a quick DNS change.

zhoutong · on March 3, 2013

My startup NameTerrific can support instantaneous DNS updates in a geo-redundant Anycast infrastructure. As long as your TTL is sufficiently low (<300), the impact is quite limited as propagation time is negligible at NameTerrific.

EDIT: Sorry guys. We got some issues with a gem after installing the recently updated ruby2.0.0p0. The unicorn workers were timing out. TerrificDNS is completely unaffected and the site is already running again.

avargas · on March 3, 2013

https://www.nameterrific.com/pricing

"We're sorry, but something went wrong."

Uchikoma · on March 3, 2013

Not the best advertising

http://www.nameterrific.com

"502 Bad Gateway

nginx/1.2.7"

Nux · on March 3, 2013

Hopefully the back-end is not running Ruby. ;-)

rdl · on March 3, 2013

The back-end is AWS Route53, it seems.

zhoutong · on March 3, 2013

Well, we have already soft launched our own TerrificDNS Anycast and it has replaced the Route 53 solution. TerrificDNS platform is running on Redis + PowerDNS.

Uchikoma · on March 3, 2013

Terrific.

zhoutong · on March 3, 2013

It's up now. Please see parent for explanation. Sorry!

Uchikoma · on March 3, 2013

From my experience with DNS for a large site over the last years, a lot of ISPs do not respect your TTL and have their own TTL (often 24h).

rdl · on March 3, 2013

If and only if the site's end users and their software fully respects TTL. A lot don't, especially shitty mobile networks and web browsers.

sturadnidge · on March 3, 2013

You should probably put the same amount of effort into your web hosting infrastructure http://imgur.com/ok5lfml

gtklocker · on March 3, 2013

Nameterrific is down. https://www.nameterrific.com/

thomseddon · on March 3, 2013

Would love some details on how to do this, I've been a (business) CF customer for a while now and have never seen this service and I haven't turned up anything googling/scouring their website for it just now.

Details much appreciated!

EDIT: Spoke too soon: https://support.cloudflare.com/entries/22054357-how-do-i-do-... shame it's so convoluted

ksec · on March 3, 2013

That is the first time i have heard this as well. Thanks for the Link.

earless1 · on March 3, 2013

Yup, this CNAME service just allowed me to make a quick update to DNS and get us back up. We will probably need to automate this in the future.

overshard · on March 3, 2013

Yep, this is definitely something I'll do in the future and I guess it was my fault for trusting a single service... even one that is designed around keep your site up/redundant.

zobzu · on March 3, 2013

In my experience, cloudflare has been little more than a scam for anyone with half decent traffic. Not really surprised. Funny how the status page shows all green (are those just static button?) while they acknowledge there is an issue and that they don't know what's going on.

http://www.cloudflare.com/system-status

brador · on March 3, 2013

From the cloudfare business page:

"2500% guarantee This extended Service Level Agreement guarantees 100% uptime, and adds a multiplier to owed service credits resulting from any lapse: 5 times any downtime minutes and 5 times customers affected = 2500% guarantee."

hu_me · on March 3, 2013

well they have 785000 sites were down for 60mins so that adds up to... a lot of service credits

robotkad · on March 3, 2013

Most of them would be on the free tier

ksec · on March 3, 2013

So anyone know what is the services credit for 1 Min of downtime?

TazeTSchnitzel · on March 3, 2013

Ouch.

packetlss · on March 3, 2013

Looks like they dropped off the internet:

http://www.youtube.com/watch?v=wMRaKtydILI

AS13335 = Cloudflare

aytekin · on March 3, 2013

This is down as well: http://www.cloudflare.com/system-status

They should host this page on a third party provider.

MichaelApproved · on March 3, 2013

In that case, it should be http://status.cloudflare.com

dbuxton · on March 3, 2013

Except that I think they are hosting their own DNS... I'm not able to retrieve their SOA information even at the moment.

mh- · on March 3, 2013

ouch. been awhile since i've seen that failure mode.

bromley · on March 3, 2013

Better still cloudflare-status.com or similar: separate domain, separate registrar, separate DNS, separate hosting.

rdl · on March 3, 2013

Or just use twitter :)

I wish there were a business model for a third party site-monitor and site-uptime service, which let the site owner do more than just post updates, but also prevented the site owner from lying about historical data.

Basically New Relic (that actually worked) + Pingdom + Internet Archive + Twitter + status.example.com.

jlgaddis · on March 3, 2013

Pingdom allows you to embed (JavaScript, IIRC) data on your own site showing your status/downtime/etc. from their perspective.

mh- · on March 3, 2013

true, but let's add to the criteria:

reliable, accurate, high-frequency

TazeTSchnitzel · on March 3, 2013

Yeah, but then what if that site went down? :P

rdl · on March 3, 2013

Doesn't really matter -- as long as the odds outages at both sites are independent, and relatively low, it won't happen at both at the same time, usually. You can make the status service pretty reliable, cheaply, compared to most other services, and if it loses the central servers, the remote monitoring nodes can still store test information, so when the service comes back up, the historical record should be accurate.

tuananh · on March 3, 2013

down as well

andyhmltn · on March 3, 2013

I use a pretty major forum that has a huge amount of traffic. The owner migrated it to CloudFlare. For the past 5-6 weeks the site has 50% of its request go to a 'Sorry xyz is not available right now'

earless1 · on March 3, 2013

You should make sure that the site is not actually returning 500's for those request. We had some similar issues when we first started using the service

ksec · on March 3, 2013

Somebody pointed out about the CNAME available on Cloudflare. I never knew that and i checked out the article. The First paragraph:

"CNAME setup is a manual process generally available to paid CloudFlare plans only. If you are interested in testing CNAME setup, please contact CloudFlare first with the domain you would like to test CNAME with. Please specifically mention CNAME Setup in the subject field for faster review. Allowing for CNAME setup is entirely at the discretion of CloudFlare."

So NO: This isn't even a features at all. They made it as hard as possible to set this up and will grant you the use of it as they like.

zhoutong · on March 3, 2013

It seems that CloudFlare's DNS is down, and affecting NameTerrific as we have a CNAME record pointing to them. I had to change the CNAME record to get our site working again.

EDIT: Based on Twitter search, all CloudFlare sites seem to be down.

DigitalSea · on March 3, 2013

Cloudfare may be there when your sites go down, but who is there for Cloudfare when they go down? Nobody it would seem, ha.

mh- · on March 3, 2013

including their status page

mmahemoff · on March 3, 2013

That part surprised me. I thought it was common for ISPs to not use their own services for critical web presence, i.e. places users might visit when it's down.

stenehall · on March 3, 2013

And it's up again! https://twitter.com/CloudFlareSys/status/308170566760792064

hahainternet · on March 3, 2013

It's not just a case of DNS being down, I can't see any BGP either. Pretty major failure.

endijs · on March 3, 2013

Even their status page is down. And sure - all my sites too. And funny part is that for most of my sites i have stopped Clouflare features and use just their DNS. Never thought that I will fail because of DNS not being available.

mmahemoff · on March 3, 2013

Cloudflare user here. Can I ask why you turned CloudFlare services off?

endijs · on March 3, 2013

btw - sites are back. I hope they will not go down.

Why I have switched off services? Because once i enabled them sites got slower, not faster. Sure - I doubt that most users noticed that, but for example, if i checked Pingdom or Google Crawl Stats - 'Time spent downloading a page' situation was very clear. With Cloudflare it took Google 2x more time to download page than without. I had no time to investigate why thats so, but after switching Clouflare off i was again back to 500ms.

Edit: Sites are not back. I guess that work the ones for which i have DNS cached. Lets wait...

mmahemoff · on March 3, 2013

When I briefly did that test, I didn't find much difference on Pingdom; but I like some of their other services and I assume they can handle burst activity at least as well as my own server. So I'm a big fan overall and keep using them.

endijs · on March 3, 2013

All my sites are now back online! ~ 40min downtime.

Edit: Looks like DNS is back. However if you use CloudFlare services then you might still have problems. Like:

504 Gateway Time-out cloudflare-nginx

wyuenho · on March 3, 2013

It took a CloudFlare total wipeout to discover how useless our browsers are against domain name lookups that take a ridiculously long time to timeout.

News flash: CDN fallback like the one below is next to useless unless the first request times out reasonably quickly.

http://css-tricks.com/snippets/jquery/fallback-for-cdn-hoste...

jelled · on March 3, 2013

imgur seems to be down for me as well.

mappu · on March 3, 2013

A workaround: try replacing imgur.com in the URL with filmot.com.

SchizoDuckie · on March 3, 2013

I'm going to say this on the posibility of this being seen as a flamebait... But You should have chosen Akamai over Cloudflare.

It's so funny how everybody jumps on top of new companies that say they can proxy all of the interwebz for a low price. (Cloudflare, Blackberry)

And then they fail...

UnoriginalGuy · on March 3, 2013

Akamai isn't as accessible as Cloudflare. For one example I cannot even see how much they charge for similar services, I'd likely have to get a "quote" which means that the whole thing is out of budget for most small-medium businesses.

Akamai's whole web-presents is clearly aimed at enterprise class customers. I mean there are whitepapers everywhere, the site is filled with business nonsense and I cannot even get started without at least a phone call or five.

Maybe people "jump on" to new companies because those companies actually offer a product old companies are not? I can go to Cloudflare right now and sign up, I can give them my CC info and pay, and everything is up-front and easy. I can be online with Cloudflare within the hour (or whenever DNS moves).

saurik · on March 3, 2013

I have been using CDNs since I was a very small business (CacheFly, EdgeCast, and now CDNetworks), and if you are seriously using CloudFlare because you haven't even tried setting up a phone call with Akamai due to some kind of mental aversion, you are doing your business a disservice. (I mean, seriously: is the 20 minutes you might waste really that valuable? What's the harm... it might even go well!)

PeterisP · on March 3, 2013

Service sales 101 - if a business model contains manually/individually quoting every potential customer, then your prices cannot possibly be reasonably low, as your sales model requires enterprise-level prices to pay for your salesmen.

It's not that I can't spend 20 minutes on call - but if your customer acquisition costs involve at least 20 minutes of salesman-time per a possible lead (thus, at least 200 minutes of salesman-time for even the smallest sign-up), then I'm very sure that whatever number they will quote - that number will not be something that I can or want to afford. Let them pester some 'enterprises' with that.

saurik · on March 3, 2013

Again, I have been using CDNs for a while now, including one that is much larger than CloudFlare (CDNetworks), and on various occasions I have talked with Akamai to get quotes. In my book, CloudFlare reams medium-sized companies on price, and they can do that precisely because people such as yourself have miscalibrated your "I'm getting screwed" detector. (The same is true of Amazon CloudFront, btw.) In essence, once you realize people have this algorithm (specifically, where they don't even bother calling a sales person) it is trivial to exploit you by quoting a large price that you will never do comparison on (and then, rather than compete on price, you end up competing on a ton of questionable value-adds, and do price discrimination on things like 24/7 support).

PeterisP · on March 4, 2013

Well, I'd notice if I'm getting any invoices worthwhile to haggle about? If I'm paying $200 per month then would Akamai even bother to give me a quote? And if I'm paying $20.000/yr for anything, then sure, it's good business sense for a medium business to reconsider suppliers at least yearly even if you're very happy with the current one - just to shop around and verify prices.

saurik · on March 5, 2013

If you are at the "I don't need 24/7 support" level, then sure: you might need to use something like CloudFlare ;P.

rdl · on March 3, 2013

Akamai won't even talk to people for $0, $25, or $200/mo. I doubt they'd do so for $3000/mo, directly, unless they thought you'd grow.

moe · on March 3, 2013

You should avoid Akamai at all costs. They treat you like dirt under their fingernails until you have a 6-figure contract. And even then the experience remains nothing short of surreal...

Report a service problem? Welcome to the indian support-center, where you will be slowly spoon-fed the next best canned response that may or may not match your problem. Escalating to anyone with half a clue is near impossible.

Need a quote? Wait weeks at a minimum for someone to get back to you.

Oh, did we break your reporting? Of course this is your fault, Akamai doesn't make mistakes. Now talk to the hand.

If you're in the market for a CDN then there's plenty candidates who will sell you a better experience for less money (Level3, EdgeCast, BitGravity, etc.).

rdl · on March 3, 2013

OTOH if you're paying enough, you get really good service and support (I have a bunch of friends who work there who say exactly what you're saying), so figuring out a way to use them through a reseller is actually a fairly legitimate choice.

moe · on March 3, 2013

I have to wonder how much is "enough" then, our account wasn't that small when we finally ditched Akamai.

Meanwhile none of the other CDNs has given us such a terrible experience. It's almost as if they care, imagine that!

When you consider that the performance differences are at best marginal (the only real differences exist on mobile and in emerging markets, and Akamai is not on top of that) there's really no good reason to put up with the spoiled brats of Akamai in this day and age.

saurik · on March 3, 2013

At that $3k/mo level, you can definitely talk to CDNetworks (the CDN I currently use), which is sandwiched between two orders of magnitude of scale, CloudFlare on one side and Akamai on the other. (That said, CDNetworks seems to be much better positioned with regards to China than Akamai.) (That said, I'm actually pretty certain that Akamai would talk to you at the $3k/mo level: have you even tried calling them?)

rdl · on March 3, 2013

They'd talk to you if you were a startup and $3k, but probably not so much (directly) if there wasn't growth potential. There are hosting providers who resell Akamai for smaller customers, though. (The only time I ever cared about high-end CDN involved businesses Akamai wouldn't serve, though.) I haven't gone through the normal sales route with them, but I know lots of internal Akamai people in security/ops/etc. and small customers are not really their market.

CDN itself is essentially a commodity; it's not too hard to keep multiple CDNs in rotation. There are probably 20+ big CDNs worth consideration and another bunch of resellers. (Amazon CloudFront, BitGravity, Level3, Limelight are probably the first ones I'd think of for smaller sites; Akamai is still the undisputed king for top performance.)

DNS is the thing which is more interesting to me.

I'd probably go with Route53 for cheap good anycast DNS right now; everyone else seems to either be a clown or super expensive (or bundled with other expensive DNS service). Ultimately I guess I'll end up doing internal DNS. (non-anycast DNS is also a total commodity, but good anycast dns not as much) DynECT also looks pretty good. Not sure what other anycast DNS providers there are in the <$500/zone/mo range.

saurik · on March 3, 2013

> There are hosting providers who resell Akamai for smaller customers, though.

There are also many other CDNs that exist in the massive territory between CloudFlare and Akamai (such as CDNetworks, the company I had mentioned).

> CDN itself is essentially a commodity; it's not too hard to keep multiple CDNs in rotation.

For latency-insensitive use cases in generally centralized territory, I agree that CDNs are "essentially a commodity". The correct strategy would seem to be to call a number of them, and negotiate a good deal, not to assume that the one that has a printed sticker price is somehow the right choice (as some people here seem to have been doing ;P).

However, to make the counter-point to this: the cache hit ratio that is being reported by CloudFlare for evasi0n.com (note: I do not have control over that site's hosting; that's choice was due to planetbeing and pod2g) is 81% <- this is for a static single-page information site. How various CDNs handle caching, whether they cache you on disk or in RAM, what they do with regards to hot connections or pre-fetching... these all have massive performance implications on your website.

rdl · on March 3, 2013

It's a totally reasonable thing for a person who is busy to "satisfice" on many priorities, vs. optimize. Maybe CloudFlare isn't optimal, but if I can get a price and sign up in minutes, and it's good enough, that might be the right choice. It's not just the time; it's that talking to a salesperson is usually psychologically draining. You'll never be able to pick up a phone and get a price in a few minutes; it's always "where is your business located", "x is the rep", "x will call you back", etc. It turns into a fiasco. You end up having CDN sales reps come to your office to meet with you to "understand your requirements". etc.

Punishing "old-school enterprise sales tactics" which try to keep price from being transparent is a reasonable choice. If you're a big content site, yes, you should go through the effort, but for someone who just wants a small service, buy from people who publish their prices.

CloudFlare isn't the only CDN which publishes pricing -- CloudFront with AWS is very transparent. Rackspace Cloudfiles is transparent. BitGravity is fairly transparent. Cachefly. etc.

Akamai is the worst at this, but Level3, CDNetworks, and Limelight don't publish pricing either.

Offering a free service like CF does is the genius of the freemium model -- even if your service is more expensive or less suitable at the high end, people who start out because it's free and easy will often stick with you as long as you're "good enough" as they grow.

saurik · on March 3, 2013

I find it interesting that you bring up CloudFront, because they are also very expensive. As far as I can tell, because there are so many people of there who have a mental aversion to talking to another human and negotiating, they can charge an insane premium on an "engh" service.

Regardless, if picking up the phone and negotiating with a CDN, someone whose opinion of you is totally irrelevant and where the worst-case outcome is "we won't do business with you", how are you going to handle support on your own product, or court investors of your company?

SchizoDuckie · on March 3, 2013

So if you spent just $200/mo on a CDN, Did you really need that CDN at all?

rdl · on March 3, 2013

That's irrelevant. A free hobby site might be worth spending $200/mo on CDN for, but I'm not going to drop the $10-30k/mo to get an $AKAM sales rep to call me back.

If you're doing mostly static data which rarely changes, you can probably get very high hit rates on CloudFlare, and it's cheaper than even crappy $750/Gbps/mo colo bandwidth then.

LemonDrizzle · on March 3, 2013

Rackspace Cloud Files is powered by Akamai.

rdl · on March 3, 2013

I thought it was Limelight by default, and you had to pay extra for Akamai.

(EDIT: I guess they switched; I haven't kept up with Rackspace)

ksec · on March 3, 2013

Um, You could have go with a Akamai Reseller instead? Like Rackspace Cloud?

The same goes for EdgeCast, Mediatemple, GoGrid, etc are resellers.

oellegaard · on March 3, 2013

I'm so happy we didn't go on that wave. Redirecting your DNS to someone else seems like a bad idea in any case. In any case, what do they do, that I could not have done with Varnish?

mmahemoff · on March 3, 2013

Bringing content closer to the user; throttling and protecting against certain attacks; providing a response when your site is down (yes, ironic).

Also, randomly useful filters like adding the user's country to the request header, tweaking outbound images, and auto-injecting google Analytics.

ryancl · on March 3, 2013

Yes this and increasing load times tenfold.

bvdbijl · on March 3, 2013

A content delivery network?

saurik · on March 3, 2013

They are only sort of a CDN; in addition to DNS, they specialize in site optimization via content transformation (a la mod_pagespeed in the cloud) and "DDoS protection" (which is pretty much them replacing your website for new users with some JavaScript that tries to determine if you are a legitimate client).

They don't promise to cache much, and they in fact don't: even on very simple single-page static information websites, such as evasion.com, they have an abysmally poor 81% cache hit ratio. They don't help at all with dynamic content due to having poorly located nodes and lots of heavyweight code running in their proxy. Their lack of many nodes in good positions (compared to something like CDNetworks or Akamai, they are one or two orders of magnitude smaller) also means they can't provide very good latency even for the times when they actually happen to have something in cache.

(Note: if someone is now going to say CDNs don't generally do well with dynamic content, they are wrong: normal CDNs actually improve the performance of dynamic content incredibly by maintaining large-window pre-connected HTTP sessions to customer origin servers, often over private networks that already provide better bandwidth: you can easily see 2x latency improvements with a normal CDN even for fully dynamic content).

So, they really shouldn't be compared with a "CDN": they have an interesting service that actually provides something valuable for many key use cases (4chan comes to my mind: in essence, something that is actually likely to experience a true DDoS attack sufficiently often and with sufficiently little provocation that it makes sense to add an external system to your infrastructure), but if you need a "CDN" there are many more reasonable alternatives that don't have as many moving parts and are thereby going to break much less often (and, if they actually do, should break only in localized regions).

richo · on March 3, 2013

We're in the process of implementing just that. I'll blog about as much as I can and hopefully publish most of not all of the code.

robotkad · on March 3, 2013

Their DDOS mitigation is what sold us.

grose · on March 3, 2013

Seems like half of the internet is broken just because of this...

EdisonW · on March 3, 2013

Was just about to post this. All of my sites are down. :(

Edit: 5:01 EST ..seems to be up and down according to mass pingdom messages.

Neso · on March 3, 2013

They are back online, all my sites active

UnoriginalGuy · on March 3, 2013

This took out Imgur.com (Reddit's favourite image hosting site) for several hours. So it was definitely felt.

ensky · on March 3, 2013

UP now!!! https://twitter.com/CloudFlare

ck2 · on March 3, 2013

https://twitter.com/cloudflaresys

vini · on March 3, 2013

down here too and I can't change the dns of my site cause my registrar is down too, oh boy.

Neso · on March 3, 2013

You are really lucky :p

tuananh · on March 3, 2013

Olps, major fuck up

fiendsan · on March 3, 2013

its sunday! seems like they pushed another faulty update (like last time)! yep confirmed, all is down including their own site, thats pretty fucked up, when they dont even have offsite status! good thing i dont use cloudflare in all my sites...

sairamkunala · on March 3, 2013

Its back as of 10:51 GMT . CloudFlare and my sites on CloudFlare seem to work again.

nodesocket · on March 3, 2013

Seems to be coming back, though intermittent http 502 bad gateways errors.

xPaw · on March 3, 2013

Yep, any website using CloudFlare's DNS is not resolving.

Ateoto · on March 3, 2013

For those of us using cdnjs.com, hopefully you have local fallbacks.

jtchang · on March 3, 2013

Whatever happened to primary and backup DNS servers?

tuananh · on March 3, 2013

all of my sites are down as well. Should I change name server or this issue should be resolved quick (enough)?

tonyjin · on March 3, 2013

Confirmed, several of my sites are down.

tuananh · on March 3, 2013

Seems like the main website is back up

joe5150 · on March 3, 2013

I'm getting alternating 504 and 502 errors for the main website. The server signature is "cloudflare-nginx", so it's definitely getting through, though.

rorrr · on March 3, 2013

504 Gateway Time-out for me.

tuananh · on March 3, 2013

it's back up for me http://monosnap.com/image/pYrdbObsYUhKwAWla5gDzzsg0

noveltysystems · on March 3, 2013

What a let down... Very disappoint.

ryancl · on March 3, 2013

And who exactly didn't expect this?

Hengjie · on March 3, 2013

Yeah mine is also down.

ippa · on March 3, 2013

dns back, now I get bad gateway instead, progress :P

oron · on March 3, 2013

Going back to go daddy :-(

ddaeo5 · on March 3, 2013

Still down. When they get back up, I will cancel my subscription and gladly DDOS them. :D

Neso · on March 3, 2013

Our fault, we should have back NS ...

ensky · on March 3, 2013

we all expected cloudflare to be more stable than our server ...

dbaupp · on March 3, 2013

Presumably they are.

I have no idea what is wrong, or how long they will take to fix it, but I'd imagine that CloudFlare has significantly better network engineers than the average company, and so they will fix it in far less time than the average company would fix the same problem.

saurik · on March 3, 2013

But, for the amount of money many paid customers are paying them (in essence, anyone at that $3k/mo level that includes the critical 24/7 phone support), you can actually get an account with a company like CDNetworks or Akamai (if nothing else, with a reasonable CDN like EdgeCast) and have still-better network engineers than CloudFlare.

Also, even if you are using them for free: they aren't replacing people you have in house... they are an additional component that can independently fail, in addition to any of the things that would have caused your average company's network engineers to fail. They don't promise to cache enough content to replace much of your infrastructure.

Igalze · on March 5, 2013

This is what Juniper had to say:

“While we have not completed our investigation, we believe this incident was triggered by a product issue that Juniper identified last October, when a patch was also made available"

Good network engineers tend to apply newly release patches. This vulnerability was documented for almost half a year...