Appreciate the detail here. It's a great writeup. Wondering what folks think about one of the changes:
5. Changing the SOP to do staged rollouts of rules in
the same manner used for other software at Cloudflare
while retaining the ability to do emergency global
deployment for active attacks.
One concern I'd have is whether or not I'm exercising the global rollout procedure often enough to be confident it works when it's needed. Of the hundreds of WAF rule changes rolled out every month, how many are global emergencies?
It's a fact of managing process that branches are liability and the hot path is the thing that will have the highest level of reliability. I wonder if anyone there has concerns about diluting the rapid response path (the one having the highest associated risk) by making this process change.
Yep, that's the exact bullet point I was writing a response on. Security and abuse are of course special little snowflakes, with configs that need to be pushed very fast, contrary to all best practices for safe deployments of globally distributed systems. An anti-abuse rule that takes three days to roll out might as well not exist.
The only way this makes sense is if they mean that there'll be a staged rollout of some sort, but it won't be the same process as for the rest of their software. I.e. for this purpose you need much faster staging just due to the problem domain, but even a 10 minute canary should provide meaningful push safety against this kind of catastrophic meltdown. And the emergency process is something you'll use once every five years.
Your response highlights a good idea to mitigate the risk I was trying to highlight in mine.
They want to have a rapid response path (little to no delay using staging envs) to respond to emergencies. The old SOP allowed all releases to use the emergency path. By not using it in the SOP anymore, I'd be concerned that it would break silently from some other refactor or change.
Your notion is to maintain the emergency rollout as a relaxation of the new SOP such that the time in staging is reduced to almost nothing. That sounds like a good idea since it avoids maintaining two processes and having greater risk of breakage. So, same logic but using different thresholds versus two independent processes.
Right. The emergency path is either something you end up using always, or something you use so rarely that it gets eaten by bit-rot before it gets ever used[0]. So I think we're in full agreement on your original point. This was just an attempt to parse a working policy out of that bullet point.
[0] My favorite example of this had somebody accidentally trigger an ancient emergency config push procedure. It worked, made a (pre-canned) global configuration change that broke everything. Since the change was made via this non-standard and obsolete method, rolling it back took ages. Now, in theory it should have been trivial. But in practice, in the years since the functionality had been written (and never used), somehow all humans had lost the rights to override the emergency system.
My personal rule is that any code which doesn't get exercised at least weekly is untrustworthy. I once inherited a codebase with a heavy, custom blue-green deploy system (it made sense for the original authors). While we deployed about once a week, we set up CI to test the deployment every day.
> Security and abuse are of course special little snowflakes, with configs that need to be pushed very fast, contrary to all best practices for safe deployments of globally distributed systems.
Once upon a time, I worked on a system where many values which would otherwise be statically defined in similar systems where instead put into a database table. This particular system didn't have a proper testing and deployment pipeline set up, so whereas a normal system would just change the static value at some hard-coded point in the code and quickly roll it out, this system needed to keep it in the database so that it would be changeable in between manual deployments (months or even years apart). The ability to change the value facing the user by changing the value in the database inflated the time it took to test a release, thus exacerbating the amount of time it took to release a new version, but well, it worked.
My point is that if security and abuse rules need to be rolled out quickly, then the system needs security and abuse systems where the entire range of security and abuse configurations (i.e. their types) are a testable part of the original pipeline. Then the configurations can safely be changed on the fly, so long as the changes type-check.
It's easy to understand why it's never been built though - you'd need both a security background and a Haskell-ish/type-theory kind of background. Best of luck finding people like that.
The main problem is that their Regex library doesn't have a recrusion limit. I'm honestly amazed they've been able to scale Lua scripts to the point they can use it as a global WAF. Knowing this, it may be easy to create attacks against their filters.
My takeaway is that it's time to move to a custom solution using a more flexible language. A simple async watchdog on total rule execution time would have prevented this. When running tons of Regex rules I'm amazed they didn't have this
I am wondering why you are being downvoted. This outage could have been prevented with better deployment procedures too.
For example my company (nowhere near the scale of Cloudflare) does progressive deployments. New code is deployed only to a handful machines first, and then as the hours pass and checks remain green it propagates to the rest of the server fleet. Full deployment takes 24 hours. We never had code breaking changes in production in the past 3 years. And before that, us breaking things was the most common occurence for production issues. Of course that's not the only thing we do, good test practices, code reviews etc.
The second thing, is separation of monitoring and production. If production going down takes down the monitoring systems too, you will have a very hard time figuring out what's wrong. Cloudflare says "We had difficulty accessing our own systems because of the outage". That sounds very bad.
I 'd wager there are many wrong things at play here other than "regex is hard". But I guess HN loves cloudflare way too much to ask the hard questions.
Yeah, they get some points for admitting WAF rule updates bypass canary deployments so that they can be applied ASAP. But still.
Recursion attacks against Regex are extremely well known. The only reason I can fathom for not having an execution time watchdog is that Nginx Lua runtime doesn't allow it. I assume the scripts run during a single request cycle on one thread due to Nginx async IO (one thread per core only).
That's still no excuse. They admit to running THOUSANDS of Regex rules in custom Lua scripts embedded in Nginx. This sounds like a bad idea to anyone that knows anything about software because it is.
My previous employer embedded way too much Lua script inside Nginx plugins for the same reasons (it's easy). Even at our "scale" (50 requests/second) we had constant issues. To think they run ~10% of internet traffic on such a rube Goldberg machine is proof you can use just about anything in prod (until it inevitably explodes at least)
Confused what this response is trying to say? Did you read the whole post? They addressed exactly those two things and explained how they're fixing them. You're just repeating part the blog post essentially; which is why I wonder if you finished reading it.
I'm interested in why they wouldn't use LPeg instead. Those seem a lot easier to compose, reason about and debug; plus they have restricted backtracking.
They still retain the global rollout for the other use cases detailed in the write up, so its generally tested, though not for this one use case as you point out. I suspect the tradeoff is reasonable, however having a short pre-stage deploy before global in all cases would be a more conservative option that would prevent an emergent push from becoming an even bigger emergency!
The outage was caused by a regex that ended up doing a lot of backtracking, which caused PCRE, the regex engine, to essentially handle a runaway expression.
This reminded me of a HN post from a couple months back by the author of Google Code Search, and how it worked: https://swtch.com/~rsc/regexp/regexp4.html . Interestingly, he wrote his own regex engine, RE2, specifically because PCRE and others did not use real automata and he needed a way to do arbitrary regex search safely.
The problem is that a deterministic regex engine (deterministic finite automata or DFA) is strictly less powerful than a non-deterministic one (NFA). DFA's can't backtrack, for example. In addition, DFA's can be quite a bit slower for certain inputs and matches.
"You are technically correct. The best kind of correct."
In theory, your statement is perfectly correct. However, quoting that reference:
"However, if the NFA has n states, the resulting DFA may have up to 2^n states, an exponentially larger number, which sometimes makes the construction impractical for large NFAs."
This means that in practice, DFAs are larger, slower, and sometimes can't be run at all if complex enough.
However, this was my mistake. I remembered (vaguely) the 2^n issue and didn't follow up to make sure I was accurate.
And I completely spaced on the fact that neither NFA's nor DFA's handle backreferences without extension.
I don't know what you mean by "DFA's can't backtrack". Maybe you mean DFAs don't support backreferences, which is true, but NFAs don't support backreferences either.
I believe if r is the size of the regex and d is the size of the data, an NFA is O(r) to compile and O(rd) to execute, while a DFA is O(2^r) to compile and O(d) to execute. So DFAs are slower to compile, but faster to execute.
“regular expression” has different meaning in programming context and formal language context. Regular expressions in regex libraries do more than match regular languages.
PCRE can recognize also all context free languages and some subset of context-sensitive languages. Just having backreferences makes the problem NP-hard.
9. We had difficulty accessing our own systems because of the outage and the bypass procedure wasn’t well trained on.
Suggestion for future, learned from bitter experience: separate your control plane from your data plane. In this case, make sure that the tools you use to manage your infrastructure don't depend on that infrastructure being functional.
That way you won't have to remember how to use a bypass procedure -- it will just be your normal procedure.
Well, as with all things based in technical nuance, it depends on your definitions. Sure, control planes and data planes should be logically separated. But as you build and ship compelling products, your developers will gravitate to using well-built products’ (data plane) resources to build new products.
Imagine an IaaS cloud. It starts will Compute, Networking, Storage (block) and maybe Object Storage/S3. Next comes a fully-managed database product. The Database team may want to leverage the Object Storage data plane in the Database control plane. A year or two down the road, a team building a SaaS application will probably look to use the fully-managed database as it’s one less piece of infrastructure to manage.
To avoid or eliminate these types of delays in resolution, it’s imperative that the product team have a strong understanding of failure modes and dependencies. There’s a lot to be said for building completely isolated foundational services — it’s also a very expensive undertaking. Lastly, it’s possible to build out-of-band/break glass access without compromising security.
(I work at a global cloud but have no familiarity with CloudFlare’s internals.)
> The Lua WAF uses PCRE internally and it uses backtracking for matching and has no mechanism to protect against a runaway expression. More on that and what we're doing about it below.
We run a WAF based on LuaJIT in resty. Just to be clear, the resty interface to PCRE does provide a DFA mode. Furthermore, Zhang actually ported RE2 (see other comments here) to C as sregex, which is usable from Lua as a c module regardless if it runs in resty or a custom Lua app.
> Switching to either the re2 or Rust regex engine which both have run-time guarantees. (ETA: July 31)
Not addressed at Cloudflare, since they had a defense in place. But just in case anyone else is running a similar thing in Lua.
And:
> In the longer term we are moving away from the Lua WAF that I wrote years ago.
Then sregex might be the perfect fit here. Though Rust is technically safer. Depends on what longer term means.
> Unfortunately, last Tuesday’s update contained a regular expression that backtracked enormously and exhausted CPU used for HTTP/HTTPS serving.
One of those cases where they had 1 problem, used regular expression and ended up with 2 problems ?
Edit: I really like how much information is given by CloudFlare. 11 points in the "what went wrong analysis" is how every root-cause analysis should be done.
Somewhat humorous, as someone [1] (congrats /u/fossuser!) mentioned this failure scenario in the thread about Twitter being down yesterday.
"Pushing bad regex to production, chaos monkey code causing cascading network failure, etc.", in response to a comment from someone who previously worked at Cloudflare.
I agree this is an awesome post and a really great example of how every Root Cause Analysis needs to be done. I am also impressed by their incident response.
> Then we moved on to restoring the WAF functionality. Because of the sensitivity of the situation we performed both negative tests (asking ourselves “was it really that particular change that caused the problem?”) and positive tests (verifying the rollback worked) in a single city using a subset of traffic after removing our paying customers’ traffic from that location.
Haha, so the free customers are crash test dummies for providing test traffic. Nice.
I actually don't mind that much, considering it's basically bulletproof DDoS protection for free. I'd much rather "be the product" in this way than in the way ad companies cause at least.
I used to work as a network engineer for awhile, now do web development. I worked with a number of cloud providers and you always have to roll out any fix carefully even if you're 100% sure (you're never 100% sure) that you've got the fix.
I honestly just assumed that when customer's chose where they would try things outside their lab, it was lower level customers, less busy part of the network, anywhere the impact isn't as serious. That's where the lowest risk is.
Some customers would discuss their own customer's by name as far as "Should we try this change on Customer Y?" And the discussion would work along those lines.
When I started deploying my own software, I just assumed anything that I was deploying to for free was a sort of "lab light" for them. I also don't mind, it seems fair.
ANY change outside a lab... is its own experiment.
Smaller customers don't have the same web traffic, which may not be enough to trip any given failure scenario. One could imagine that the backtracking in an onerous regexep is only triggered with a sufficiently large customer that has a path that is especially difficult to match.
With staged rollout and without a "fast" deploy procedure, by the time it hits the larger customers, it's already been deployed to some percentage of the fleet - and then you still have a problem, with a significant proportion of your fleet.
Staged rollouts are an entirely reasonable risk mitigation idea, mind you, and not one I'm even arguing against.
My point is that unfortunately it's no panacea, especially at scale. Which is what makes this all an experiment.
In this case yes, however they also indicate this is how they do their staged rollouts in general. So if they are releasing any other software update that goes through the staged rollout free customers are tested first. If that change broke something, free customers get that first. Which seems fair to me.
In my experience it’s generally best to roll out changes on testing, staging, and then clients in order of how much they pay, especially if you have SLAs with the highest paying customers.
Impact is generally lower, both to the client, and to your bank account.
That sounds strange to me. If you introduce a bug then roll back very quickly, it will only affect high paying customers. If you introduce a bug then roll back a while later, it will impact high paying and low paying customers equally. Why would you want this scenario? If you flip it it seems strictly better to me.
The idea is that the fix itself is being tested. If you knew your 'rollback' will work for certain, then you'd just deploy it to everyone asap. But since you don't, you test it and as a potential outcome of your test is no fix or making things worse, you don't test it on your highest-value customers. Imagine what your postmortem would read like if your fix made an even bigger mess.
I'm just describing what I think the sequence in the postmortem is. They were already in the poop and wanted to test their fix in a real but low-impact way.
Overall I think it's a good deal for both users and Cloudflare. Users get a major CDN for free, and instead of paying for it with ads, surveillance or other shady thing, they pay by being beta testers.
> I'd much rather "be the product" in this way than in the way ad companies cause at least
Your customers are the product. Cloudflare sets a first party tracking cookie on every domain they serve. They unwrap TLS and can see every product your customers look at or buy.
Whether intentionally or not, they built the Ad Network 2.0. They found the solution to ISPs not being able to snoop, and browsers locking down third party tracking.
That blow up is quadratic, not exponential: "This is not classic catastrophic backtracking (talk on backtracking) (performance is O(n²), not exponential, in length), but it was enough."
Always appreciate the transparency from you and Cloudflare. :)
My main fright during this outage wasn't really the outage itself, but the fact that I couldn't log into the dashboard and simply click the orange cloud to bypass Cloudflare in the meantime. I'm assuming that this is now covered by this mitigation:
>> 6. Putting in place an emergency ability to take the Cloudflare Dashboard and API off Cloudflare's edge.
If so, and if this would have prevented the dashboard outage even during the WAF fiasco, this is a huge comfort to me. Just curious, though: how far can you really go in separating Cloudflare "the interface" from Cloudflare "the network?"
And in general, what does everyone on HN think about mission-critical companies using their own infrastructure and being their own customer? Especially when the alternative is using a competitor?
I've always been a proponent of separating the monitoring from the infra. Otherwise your insight is binary: the service is either up or down. You don't have an context as to why.
Edit: Additionally, from a competitive standpoint, I don't see a problem with using a third-party platform for a monitoring service.
Yes, this is always one of our primary questions we ask when deciding when and how we should dogfood our own services; will we create a circular dependency where our ability to fix an issue on one service is hindered by any chain of dependencies between the service with the issue and the service used to fix it. We always avoid those, or at least have easy alternatives.
Absolutely agree about an external monitoring service being a necessity. I was more referring to cloudflare.com (and specifically dash.cloudflare.com) being entirely served through Cloudflare itself, or the AWS console being hosted on AWS, etc.
>> 6. Putting in place an emergency ability to take the Cloudflare Dashboard and API off Cloudflare's edge.
> If so, and if this would have prevented the dashboard outage even during the WAF fiasco
It wouldn't prevent the initial dashboard outage. However, in a similar situation where the main issue can't be resolved quickly, it would allow them to restore dashboard access.
> In the last few years we have seen a dramatic increase in vulnerabilities in common applications. This has happened due to the increased availability of software testing tools, like fuzzing for example (we just posted a new blog on fuzzing here).
So security/debugging tools increased the number of [discovered/exploited] vulnerabilities, because developers don't use them. Only malware developers and third-party security researchers take the time to test security.
Yup, unless you're a seriously security or stability focused company you don't even use basic stuff like static analysis, let alone fuzzing. Tools for these are often either expensive, hard to use or both.
Awesome write up as usual John. I'm no expert so I was wondering: "In the initial moments of the outage there was speculation it was an attack of some type we’d never seen before." - Is there a reason you would go to this first vs checking the last lot of code deploys first/and or at the same time?
There really should a prize or something for best postmortems to reward companies for doing this & giving them some PR.
Most of us will (hopefully) never be in a situation like this so "book knowledge" of extremis events is the next best thing available. And that relies on good write-ups.
1. It appears there was a safe path with more safety and scrutiny, and a fast path with less. In this case, over time, the fast path became routine. Are there other places where this pattern could develop or has already developed? Is this tradeoff between speed and scrutiny actually necessary? (ie could you have urgent updates reach production faster but actually receive more scrutiny/more testing, even if that happens after the fact?)
2. In a similar vein, if the system has a failsafe configuration (eg only changes that have passed the full barnyard, or configurations that have been running safely for more than a certain amount of time), would it be plausible to automatically roll servers back to that configuration if they remain unresponsive for a certain amount of time?
3. It seems as though there are multiple points (big WAF refactor, credential expiry, internal services dependent on working prod) where a sufficiently cynical engineer would say "I bet there's something here that could, if not bring down the site, at least ruin someone's day". Is there a suitable voice for this kind of cynicism? Eg, a red team or similar? If you were Murphy's Law incarnate, messing with Cloudflare's systems to achieve maximum mischief, where would you start?
4. I get the sense that there are many reliable and well-tested layers of safety, but is it common to test what happens if they fail anyway? Eg: let's pretend Cloudflare just got knocked out globally by a wizard spell, what do we do? Or let's say our staged rollout system gets completely bypassed because of solar flares, how bad is it? Beyond developing a procedure or training for these kinds of situations, are they actively simulated or practiced?
If anything, I'd guess the root root cause here is a success failure, where the system has been so reliable for so long that the main reactions to it failing are disbelief and unpreparedness. I'm sure it wasn't funny at the time, but it gives me a chuckle to imagine the SREs speculating about Mossad quantum-tunnelling 0days or something because the idea of everything falling over on its own is so unthinkable. Meanwhile, those of us without so many 9s would jump straight to "I probably broke it again."
Overall I think these are very thoughtful, and I upvoted your comment.
However, I don't think this question is very fruitful:
> let's pretend Cloudflare just got knocked out globally by a wizard spell, what do we do?
The way you solve a production issue is you identify its cause and then contain, mitigate, or fix it. I don't think you'd learn anything useful from a drill where there's no specific cause.
Perhaps along similar lines to what you're thinking of, something I could see being useful is to look at components that you've already thought to implement a 'global kill' for, like WAF, for instance. Maybe you could run drills where every machine running WAF starts blackholing packets, or maxing out RAM, or (as happened here) maxing out CPU, the kind of thing where you'd want to execute the 'global kill' in the first place. That way, you can ensure that the 'global kill' switches are actually useful in practice. Something like that seems more grounded to me, making the assumption that something specific is going wrong and not just "magic", while still avoiding too-specific assumptions about what can and can't go wrong.
Maybe I'm not the intended audience, but I had to look up the acronym WAF since it came up a lot in this article. I'm assuming it's "web application firewall"?
Also interested in this protection. How does it detect this situation and what does it do when something is detected? Did you notice when it was accidentally removed that your monitoring of this condition went to zeros (or was it never happening during normal operations)?
The last time they had a global problem, everyone scrambled for more than a week. (Cloudbleed)
This 30-minute global outage was pretty nasty, but not anywhere near as awful. Timing helped, as nothing truly critical was affected. (There are some extremely high-volume sporting events which, if affected even just for few minutes, can have a direct impact on the bottom line.)
I do not wish to see more of these. Cloudbleed gave me two weeks of headache and an indigestion problem. This one did basically nothing. If there is a happy middle ground between the two, I am not exactly thrilled at finding out what it is.
Agree. Cloudbleed was really, really awful. We should write up all the things we learned from that and all the changes we've made since. Just looking at the number of engineers who are Rust experts since then, for instance.
Might be late, but has anyone in CloudFlare tried to switch away from regex to something more efficient and powerful?
Tools like re2c can convert 100s of regexs and CFG into a single optimized state machine (which includes no back tracking, as far as I remember). It should easily handle 10s of millions transactions per second per core if the complete state machine fits into the CPU level 3 cache (or lower), with a bit of optimization.
There is also Ragel [0], but I think that in this context deploying regexes as strings is safer than generating code and deploying that code (unless Ragel could generate webassembly).
Ragel has the advantage that CPU blowups happen at compile time, rather than run-time. Other risks aside, they would have avoided this problem had they been using ragel or something similar to pre-compile their patterns into deterministic machines.
The article says they're going to either switch to RE2 or Rust's regex, both of which use a DFA (a state machine) and have no backtracking.
But you do bring up a good point. RE2 and Rust both compile the regex in the same process that executes it. Compiling the regex as part of your build process then pushing the compiled form could have advantages.
It's meant to match any number of any characters, then match an equal sign, then match any number of any characters. But it's very badly written. It should instead simply be written
.*=.*
BTW, your comment got mangled by HN's markdown formatting.
I could tell you but I also want to tell you to plug that bad boy into https://regex101.com/ . It will give you a written explanation of the regexp on the right. And now, would you have believed I knew without peeking? ;)
They said it was for XSS detection. I think the purpose was to identify reflected XSS by looking for paths or headers containing JavaScript-esque variable assignment (JS keywords/syntax preceding "something=something"), but not 100% sure.
The appendix does a pretty good job, but the TL;DR is: it will match any string containing 1 or more equal signs.
Slightly more verbosely, it will match [0-or-more bytes of anything] followed by [0-or-more bytes of anything] followed by [an equal sign] followed by [0-or-more bytes of anything]. The expensive part is that it can't decide where the first grouping of [0-or-more bytes of anything] starts and the second grouping begins. It doesn't matter where the division is, of course, but many regex engines use an exponential-time algorithm for that, even though an obvious liner-time algorithm exists (and pre-dates the exponential-time algorithm!).
>A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.
Taking the safeties off to go faster... yes, you will go faster, but it might be right off a cliff.
This is a good lesson on Chesterton’s Fence. I’ve been thinking for a while that we really need the (default behavior) ability to annotate commits after the fact, so that we have a durable commentary that can evolve over time. We should be able to go back and add strongly worded things like “yes this looks broken but it exists due to this bug fix” or “please don’t write new code that looks like this. See xyz for a better alternative.”
Hell I think I’d be perfectly ok if the code review lived with the code permanently. Regression in the code? Josh warned you it was a bad idea. Maybe we should listen to Josh more?
There’s an art to that, I’ve seen new code get between the comment and the code, and since the comment is in a separate commit, it’s difficult to go back ten refactorings later to answer why. The most interesting bug fixes I do end up exploiting the commit history. Yes it’s hidden in plain sight, but it’s also more reliable.
These days we treat code as a living breathing thing. No reason we can’t do the same to commits.
I used to be really into regex and I'm now rusty, but wouldn't the desired representation of .* .* =.* be something closer to [^=]\* [^=]\* =[^=]\* ?
I feel like it could be optimized further but this would be the first step, and wouldn't most experienced regex authors use that from the beginning, nipping the whole backtracking problem in the bud and making the regex much more performant?
The original one would match "==", your suggestion would not. To get a clean regex they should switch to
.*=.*
I don't think they should spend any time contemplating whether a regex will backtrack, because it's hard. Instead they should (and are planning to) simply switch to a better regex library that never backtracks.
Well, if I still worked on Hyperscan, this would be my "what am I, a potted plant?" moment. I think Cloudflare is pretty determined to avoid x86-only implementations of anything, though.
It's entertaining to see people making the same mistakes that have been widely known about in network security well before there was Hyperscan, RE2, etc.
One thing set of alarm bells in my head from an operational perspective:
> Switching to either the re2 or Rust regex engine which both have run-time guarantees. (ETA: July 31)
That's short timescales for quite a significant change. I know it's just replacing a piece of automation with one that does the same task, but the guts are all changing and all automation introduces some level of instability, and a bunch of unknowns. Changing the regex engine is just as significant as introducing new automation from an operations perspective, even if it seems like it should be a no-brainer. I'd encourage taking time there (unless this is something they've been working on a lot and are already doing canary testing).
The other steps look excellent, and they should all collectively give ample breathing room to make sure that switching to re2 or Rust's regex engine won't introduce further issues. There's no need to be doing it on a scale of weeks.
Some quick thoughts about Quicksilver: Deploying everywhere super fast is inherently dangerous (for some reason, old school rocketjumping springs to mind. Fine until you get it wrong).
I definitely see the value for customer actions, but for WAF rule rollouts, some kind of (automated) increasing speed rollouts might be good, and might help catch issues even as the deployment steps beyond the bounds of PIG etc. canary fleets. Of course, that's also useless in and of itself unless there is some kind of automated feedback mechanism to retard, stop, or undo changes.
If I can make a reading suggestion: https://smile.amazon.com/gp/product/0804759464/ref=ppx_yo_dt... The book is "High Reliability Management: Operating on the Edge (High Reliability and Crisis Management)" (unfortunately not available in electronic form). It's focussed on the energy grid in California, the authors were university researchers specialising in high reliability operations, and they had the good fortune to be present doing a research job at the operations centre right when the California brownouts were occurring in the early 2000s. There's a lot to be gleaned from that book, particularly when it comes to automation, and especially changes to automation.
RE2 is bullet proof. It is the defacto re engine used by anyone looking to deploy regexes to the wild (used in the now defunct google code search, for example). It has a track record. Russ Cox, its author, was affiliated with Ken Thompson from very early in his career.
Rust also has a good pedigree for not being faulty. BurntSushi, the author of rust's regex crate also has a good pedigree...
We switched to RE2 for a massive project 2 years ago and haven't looked back. It is a massive improvement in peace of mind.
If anything, I'm surprised that JGC has allowed the use of PCRE in production and on live inputs...
I'm absolutely not denying that RE2 is great. Not in the slightest. I even agree with their idea to switch towards it or the Rust one.
Changing anything brings an element of risk, and changing quickly to it, even more so, which is essentially what they're proposing doing. That's where my concern lies.
Their current approach clearly has issues, but it has been running in production for several years now and those issues are fully understood, engineers know how to debug them, and there's a lot of institutional knowledge around covering them. They've put a series of protective measures in place following the incident that takes out one of the more significant risks. That gives them breathing space to evaluate and verify their options, carry out smaller scale experiments, train up engineers across the company around any relevant changes etc. There is no reason to go _fast_
This is such a great retrospective - thanks very much for this, jgrahamc. I really appreciated learning so much about how Cloudflare works internally, and the appendix (with animations!) was delightful.
> It might also be obvious that once state 4 was reached (after x= was matched) the regular expression had matched and the algorithm could terminate without considering the final x at all.
That's true if you just want a boolean result. But if you want to get the matched string (which it appears the actual code does), then you need to continue, because it's using greedy matching.
Great post! Can’t think of many companies that would spend this much time explaining regex.
I was affected by this outage, but I really appreciate Cloudflare taking the time to explain the problem in this much detail. Given their own systems were affected, I’m surprised they mitigated as fast as they did.
I'm a relatively novice regex user. Could anyone explain to me why someone might use an expression like `.(?:.=.*)` ? What is the meaning of the group if it's boundary could be placed in any number of places? Hope that makes sense.
It can be useful depending on how the engine handles the match. The non-capturing group is the important part, the .* is just there so that the 'full match' isn't just an empty string, but contains the whole line the rule is being run against
Really appreciate the detail and operational insights in this post-mortem. Great work. Also appreciate the heads up and detailed explanation of the issues with backtracking in regexes. Turns out we use those expensive patterns in a few places.
What about: WAF cpu usage wasn't isolated from the ability to serve requests? This would allow requests that don't go throwugh WAF to be able to proceed as usual.
I believe WAF is a feature customers enable, not all customers have it enabled. So some customers are already open, and in theory wouldn't need to be affected by a WAF outage.
I love how Cloudflare turns an outage into a selling opportunity by explaining exactly why you need their product to a hungry audience waiting to read about it.
Cloudflare lets their customers write their own WAF regex rules right?
And those rules still get run on every box on cloudflares edge network with HTTP requests from strangers on the internet right?
So how come this didn't get triggered by a customer first?
Perhaps it did get triggered by a customer first, but that customer didn't get too much traffic of the URL which triggers the issue, and that box got one thread stuck executing that regex for a few minutes till a health check killed it...? Does this imply that cloudflare runs with random failing health checks across the fleet and there isn't someone looking at core dumps of such failures?
That would align with my experience with seeing occasional "502 bad gateway" errors from cloudflare over the past few years. It also seems likely considering the incident where cloudflare servers leaked sensitive memory contents into HTTP responses which happened so frequently they got cached by google search. Hard to leak arbitrary memory contents without occasional SIGSEGV's...
If the above conjecture is true, it reflects very badly on engineering culture at Cloudflare. The core issue had been seen across the fleet sporadically for a long time, but was ignored, and even during the postmortem process, which should be a very thorough investigation, the telltale pre-warning signs of the issue were still missed.
They allow a limited subset of rules, with strict parameters of what logic is allowed. Unless you do something fancy with workers.
Also, the protection for this was removed in a recent update before the incident, so it wouldn't have had an impact if a customer did this until that protect was removed. So maybe a few weeks earlier they might have started seeing some problems. But again, I am pretty sure the logic in the rule that caused the issue isn't available to customers.
1. An engineer wrote a regular expression that could easily backtrack enormously.
2. A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.
3. The regular expression engine being used didn’t have complexity guarantees.
4. The test suite didn’t have a way of identifying excessive CPU consumption.
5. The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.
6. The rollback plan required running the complete WAF build twice taking too long.
7. The first alert for the global traffic drop took too long to fire.
8. We didn’t update our status page quickly enough.
9. We had difficulty accessing our own systems because of the outage and the bypass procedure wasn’t well trained on.
10. SREs had lost access to some systems because their credentials had been timed out for security reasons.
11. Our customers were unable to access the Cloudflare Dashboard or API because they pass through the Cloudflare edge.
Here's my version of what went wrong:
1. The process for composing complex regular expressions is "engineer tries to shove a lot of symbols into a line" rather than "compile/compose regex programmatically from individual matches"
2. Production services had no service health watchdog (the kind of thing that makes systemd stop re-running services that repeatedly hang/die)
3. Performance testing/quality assurance not done before releasing changes (this is not CI/CD)
4. No gradual rollout
5. No testing of rollbacks
6. Lack of emergency response plans / training
All of these things are completely common, by the way, so they're in no way surprising. Budget has to actually be set aside to continuously improve the reliability of a service, or it doesn't get done. These incidents are a good way to get that budget.
(Wrt the regex's, I know they're implementing a new system that avoids a lot of it, but in the new system they can still write regex's which (I think) should be constructed programmatically)
I don't see the relevance of how regexes are written to the problem they had. The engineer didn't typo the regex, or have a hard time understanding what it would match.
Instead, they didn't understand the runtime performance of the regex, as it was implemented in their particular system. No amount of syntax can change that.
A framework that allows well-written, "normal" code to parse out what you want, can produce something easier to understand and maintain, surfacing this type of bug in a more obvious way.
Cryptic syntax is the main reason I avoid regexes (particularly complex ones).
Too much obfuscation between the code you write and the steps your program will take. Granted, my concern doesn't apply to master craftsmen who truly understand the nuances of the tool, but in the real world those are few and far between.
ps. I get there was a lot more going on in this postmortem than just one rogue regex.
By writing regexs by hand, you can accidentally introduce an obviously backtracking pattern such as * .=. *. By programmatically composing them, a program can analyze each regex group to find simple problems, and then combine them in ways that will avoid backtracking.
This isn't even why you should compose them programmatically, though. Perl allows you to compose a regex with in-line comments (https://perldoc.perl.org/perlfaq6.html#How-can-I-hope-to-use...), but it's still a hand-crafted regex, which is error-prone, much like composing code by hand. If you can get a machine to generate it for you, you avoid unintentional human-introduced bugs, as well as make it easier to read and reason about.
If you have a ton of regex's, or they are super important to your business, you should consider not editing them by hand. There's only so much test cases can do to prevent bugs.
So in response to a catastrophic failure due to testing in prod, they're going to push out a brand new regex engine with an ETA of 2 weeks. Can anyone say testing in prod?
The constant use of 'I' and 'me' (19 occurrences in total) deeply tarnishes this report, and repeatedly singling out a responsible engineer, nameless or not, is a failure in its own right. This was a collective failure, any individual identity is totally irrelevant. We're not looking for an account of your superman-like heroism, sprinting from meeting rooms or otherwise, we want to know whether anything has been learned in the 2 years since Cloudflare leaked heap all across the Internet without noticing, and the answer to that seems fantastically clear.
This report is written by me, the CTO of Cloudflare. I say "I" throughout because organizational failings are my responsibilty. If I'd said "we" I imagine you'd be criticizing me for NOT taking responsibility.
If you read the report you'd see I do not blame the engineer responsible at all. Not once. I made that perfectly clear.
I wonder if you are able to talk a bit about the development of the Lua-based WAF. I imagine the possible unbounded performance of feeding requests into PCRE must have occurred to you or others at the time - or at least, long before this outage.
I don't mean this as some sort of lame 'lol shoulda known better' dunk - stories about technical organizations' decision-making and tradeoff-handling are just more interesting than the details of how regexes typed in a control panel grow up to become Jira tickets.
It sounds like one of the primary factors was compatibility with existing (or customer-provided) mod_security rules, if I've understood 1.75x speed hyper-you right.
Wow, I'm amazed two people could read that writeup (yourself and myself) and come to two totally different conclusions.
Pushing out a brand new regex engine surely will go through the usual process. This doesn't seem like it will take a lot of time unless there are surprises. Cloudflare clearly has the infrastructure in place already to do a proper integration test for correctness test and rampup infrastructure to ensure it doesn't cause a global outage. The global nature of this outage was because the rampup infrastructure was explicitly not used as per the protocol.
I have no idea what you read where a single engineer was singled out. At several points in this post mortem the author identifies that the regex being written by the individual involved was far from the only cause of the outage. This is a very textbook blameless post mortem doc afaict.
The narrative about the actions taken and meetings which were in is also par for the course for a good post mortem since these variables are real, and should be addressed by remediation items if they contributed to the outage. (For example, is it sane that the entire engineering team was synchronously in a meeting? Probably not.)
It seems we're reading different blog posts. Under the "What went wrong" section there are 11 points, all with differing levels of responsibility and ownership. He did well to identify the collective nature of this failure.
I don't see why switching to a new regex implementation would be so scary. 2 weeks to test that your regexes don't break seems fine? Seems like a long time tbh.
On top of that they're switching to more constrained regex engines. Rust's regex engine makes guarantees about its running time, something that would have directly mitigated a portion of the issue. And it isn't as if RE2/Rust regex aren't in use anywhere, rust's regex engine is integrated into vscode, for example.
That comment was breaking the site guidelines, quite badly in fact. We moderate comments like that the same way regardless of who or what they're about.
> there are literally 10-50 of these per day in arbitrary threads
If you can find cases of this where moderators didn't respond, I'd like to see links. The likeliest explanation is simply that we didn't see it. We don't come close to seeing everything that gets posted here, so we depend on users, via flagging (https://news.ycombinator.com/newsfaq.html) or by emailing hn@ycombinator.com.
> What is HN running on again?
I suppose I have to answer this or someone will concoct a sinister reason why I didn't. HN doesn't run on Cloudflare.
You can easily duplicate traffic into a test infrastructure that wouldn't affect the production environment, and you're acting as if re2 et al hasn't had plenty of testing too. 2 weeks with the level of traffic (test data) that Cloudflare gets seems pretty realistic.
It's a fact of managing process that branches are liability and the hot path is the thing that will have the highest level of reliability. I wonder if anyone there has concerns about diluting the rapid response path (the one having the highest associated risk) by making this process change.
edit: fix verbatim formatting