The spectacularly GOOD news is that they f—in’ found it and fixed it in 23 minutes!!!
If you’re counting, it was actually 27 minutes to get things rolling again, and 58 minutes to full recovery. Nevertheless, that’s still pretty reasonable.
https://blog.cloudflare.com/cloudflare-outage-on-july-17-202…
That change started the outage at 21:12. Once the outage was understood, the Atlanta router was disabled and traffic began flowing normally again at 21:39.
Shortly after, we saw congestion at one of our core data centers that processes logs and metrics, causing some logs to be dropped. During this period the edge network continued to operate normally.
20:25: Loss of backbone link between EWR and ORD
20:25: Backbone between ATL and IAD is congesting
21:12 to 21:39: ATL attracted traffic from across the backbone
21:39 to 21:47: ATL dropped from the backbone, service restored
21:47 to 22:10: Core congestion caused some logs to drop, edge continues operating
22:10: Full recovery, including logs and metrics
What is bad, however, is that this is not the first time an admin made a mistake that escaped into the field: https://blog.cloudflare.com/details-of-the-cloudflare-outage…
That blog post says they made the following changes in their procedures:
1. Re-introduce the excessive CPU usage protection that got removed. (Done)
2. Manually inspecting all 3,868 rules in the WAF Managed Rules to find and correct any other instances of possible excessive backtracking. (Inspection complete)
3. Introduce performance profiling for all rules to the test suite. (ETA: July 19)
4. Switching to either the re2 or Rust regex engine which both have run-time guarantees. (ETA: July 31)
5. Changing the SOP to do staged rollouts of rules in the same manner used for other software at Cloudflare while retaining the ability to do emergency global deployment for active attacks.
6. Putting in place an emergency ability to take the Cloudflare Dashboard and API off Cloudflare’s edge.
7. Automating update of the Cloudflare Status page.
I look at #5 and wonder why a staged rollout wasn’t performed in the current instance, and wonder why #1 didn’t help in this case (which also resulted in excessive CPU utilization). In both cases, a code review might have caught the problem - so I stand by my assertion that CloudFlare still hasn’t put in place proper development/test procedures. Code Reviews are so basic it’s astonishing to me they don’t have them in place.
On a related note, it was interesting to read that CloudFlare’s previous outage last year was first alerted to them via PageDuty, which is not a stock followed here. A similar service, DataDog, is followed here. While the two services have overlap, there are also things each does better. Here’s a post from PagerDuty on integrating with DataDog: https://www.pagerduty.com/blog/datadog-integration-best-prac…