Cloudflare bombs at TMF

captainccs · July 17, 2020, 6:24pm

X-post at Improve the Fool

I just found out that Fool IT has been testing out a new vendor for their data center firewall, and Cloudflare was one of the tools being tested. The product has been removed from service and either coincidentally or magically, the Bad Gateway errors have stopped, much to my relief and I am sure yours.

Fuskie

https://discussion.fool.com/i-just-found-out-that-fool-it-has-be…

Denny Schlesinger

remmdawg · July 17, 2020, 7:23pm

There was an outage on Cloudflare today. I’m not sure if this is related:

https://www.engadget.com/cloudflare-dns-problem-214413940.ht…

This seems to be the exception. Cloudflare was up 5.5% on a very positive analyst note today:

https://seekingalpha.com/news/3592194-cloudflares-long-term-…

Dave

captainccs · July 18, 2020, 2:00am

There was an outage on Cloudflare today. I’m not sure if this is related:

I believe TMF dumped Cloudflare before this outage.

https://techcrunch.com/2020/07/17/cloudflare-dns-goes-down-t…

https://gizmodo.com/this-is-why-half-the-internet-shut-down-…

Denny Schlesinger

remmdawg · July 18, 2020, 8:34am

I believe TMF dumped Cloudflare before this outage.

https://techcrunch.com/2020/07/17/cloudflare-dns-goes-down-t……

https://gizmodo.com/this-is-why-half-the-internet-shut-down-……

Denny Schlesinger

Interesting. While concerning, their overall business remains extremely strong with satisfied customers and I won’t put that much weight on one incident with TMF. As Muji also recently points out, their rapid response to the outage (which seems unrelated to NET itself) was impressive.

I will stay long NET without increasing my position further at this time and keep a close eye on them.

Dave

Iamnzane · July 18, 2020, 12:49pm

Looks like there has been a recent history of outages at Cloudflare. Customers not happy…not good.

https://blog.cloudflare.com/cloudflare-outage/

My experience is that these kinds of events usually result in a little vaccination in the future through improved procedures and lessons learned. But if it continues, then getting out of the dog house is really tough.

-zane

SaulR80683 · July 18, 2020, 4:47pm

Looks like there has been a recent history of outages at Cloudflare. Customers not happy…not good.

Hi Zane, it’s a shame you didn’t bother to read muji’s post (even though it already has over 75 recs) before posting.

Let’s see: a typo, a human error, caused a bad configuration change to a backbone pipe between Newark and Chicago, which caused huge issues on the internal part of their edge network. (Backbones are their giant pipes between data centers - their own private highway between major cities.)

The error made that route completely fail, so all the edge network traffic was re-routed through DC to Atlanta, the effect of which caused failures there, which then spiraled out. The net effect from the cascading error quickly caused ~50% traffic drop in their edge network. Things were going down all over the place

That’s the bad news. The spectacularly GOOD news is that they f—in’ found it and fixed it in 23 minutes!!! Outage over. That is so amazingly impressive that I don’t have words for it!!! Their system went down, and instead of everyone panicking, they found it and fixed it in 23 minutes!!! That is an impressive business.

Smorgasbord1 · July 18, 2020, 5:28pm

The spectacularly GOOD news is that they f—in’ found it and fixed it in 23 minutes!!!

If you’re counting, it was actually 27 minutes to get things rolling again, and 58 minutes to full recovery. Nevertheless, that’s still pretty reasonable.

https://blog.cloudflare.com/cloudflare-outage-on-july-17-202…

That change started the outage at 21:12. Once the outage was understood, the Atlanta router was disabled and traffic began flowing normally again at 21:39.

Shortly after, we saw congestion at one of our core data centers that processes logs and metrics, causing some logs to be dropped. During this period the edge network continued to operate normally.

20:25: Loss of backbone link between EWR and ORD
20:25: Backbone between ATL and IAD is congesting
21:12 to 21:39: ATL attracted traffic from across the backbone
21:39 to 21:47: ATL dropped from the backbone, service restored
21:47 to 22:10: Core congestion caused some logs to drop, edge continues operating
22:10: Full recovery, including logs and metrics

What is bad, however, is that this is not the first time an admin made a mistake that escaped into the field: https://blog.cloudflare.com/details-of-the-cloudflare-outage…

That blog post says they made the following changes in their procedures:

1. Re-introduce the excessive CPU usage protection that got removed. (Done)
2. Manually inspecting all 3,868 rules in the WAF Managed Rules to find and correct any other instances of possible excessive backtracking. (Inspection complete)
3. Introduce performance profiling for all rules to the test suite. (ETA: July 19)
4. Switching to either the re2 or Rust regex engine which both have run-time guarantees. (ETA: July 31)
5. Changing the SOP to do staged rollouts of rules in the same manner used for other software at Cloudflare while retaining the ability to do emergency global deployment for active attacks.
6. Putting in place an emergency ability to take the Cloudflare Dashboard and API off Cloudflare’s edge.
7. Automating update of the Cloudflare Status page.

I look at #5 and wonder why a staged rollout wasn’t performed in the current instance, and wonder why #1 didn’t help in this case (which also resulted in excessive CPU utilization). In both cases, a code review might have caught the problem - so I stand by my assertion that CloudFlare still hasn’t put in place proper development/test procedures. Code Reviews are so basic it’s astonishing to me they don’t have them in place.

On a related note, it was interesting to read that CloudFlare’s previous outage last year was first alerted to them via PageDuty, which is not a stock followed here. A similar service, DataDog, is followed here. While the two services have overlap, there are also things each does better. Here’s a post from PagerDuty on integrating with DataDog: https://www.pagerduty.com/blog/datadog-integration-best-prac…

Iamnzane · July 18, 2020, 5:28pm

Saul,
Yes I accept their recovery time as fair. Tells me that they have a hot standby team during upgrades and good monitor/logging tools to quickly diagnose and correct.

Sorry I do not accept that they let the issue happen in the first place. What if there was a critical life dependent activity on-going or critical financial transaction by a customer on their net? 23 minutes may be too late. Service metrics will be required at purchase time by all Global 2000 customers. FSLY needs to insure at least four 9’s for outages or better. Upgrades are the most dangerous activity. They need an incremental roll out to lower priority customers first, watch and wait, then proceed or fallback (a quick fallback Plan B if something goes wrong with an upgrade). Also, tells me they probably do not have an adequate network performance simulator in their QA group. I acknowledge the difficulties in doing this at scale. But the quality team must play for keeps.

The damage from a short outage such as this is likely recoverable with their customers. I would knock some heads and call a post mortem team together with report to all affected customers within 24 hours. But if a repeat pattern of incidents emerge, then there will be other costs. I don’t mean to over dramatize this incident. I have no worries about the stock price after a few days. But we need to watch for any pattern. As investors with our own money, we must play for keeps too.

On a lighter note, did you know that CIO really stands for ‘career is over’

-zane

12x · July 18, 2020, 5:36pm

Cloudflare was down for 30 minutes in July 2019. Somehow they persevered and lived to see today.

Cloudflare down for 23 minutes is an absolute nonissue and not worth worrying about from an investment perspective IMHO.

captainccs · July 18, 2020, 5:57pm

Sorry I do not accept that they let the issue happen in the first place. What if there was a critical life dependent activity on-going or critical financial transaction by a customer on their net?

That is not a reasonable argument. The biggest risk we all faced was getting born, it was a death sentence – execution time to be determined.

Denny Schlesinger

SoonerAJ · July 18, 2020, 6:28pm

I’ve long since removed my personal experience from the equation when choosing to invest in a company. A lot of things come down to personal preference or a different perspective, and I know billions of people and companies vote with their dollars differently than I might.

That being said…

My cloudflare experience the past week has been abysmal. First, the TMF boards were a horrendous experience for about a week. For every single post I’d read on the boards, I’d first receive a cloudflare error, then refresh the page, then get to see the content. Every. Single. Post. Fortunately that issue has been resolved. After all of that, my second experience was the aforementioned outage which impacted me about an hour.

I work in tech, so I understand things happen. And I didnt read up on TMF’s issues. It could have been an implementation problem and not a Cloudflare problem per se. Regardless, this past week has left a really bad taste in my mouth with the product.

-AJ

Iamnzane · July 18, 2020, 6:38pm

captainccs,
The big upside of a cloud service is that they are a shared, scalable resource offered at a lower price than on premise equipment. The downside is they are a shared resource and any downtime affects many customers, not just one. People many times get fired over a mistake such as this one. You may have never lived the cloud operations life, so you may not understand how critical the argument is. Smorg also called out some basic quality control disciplines (e.g. code inspections, incremental rollout, etc.) that were obviously missing. Smorg has drank the koolaid, I can tell.

That said, this is not a big enough outage yet to worry about as an investor. But we need to watch if there becomes a repeat pattern. FSLY still has the time to correct their immature processes.

-zane

epnh · July 19, 2020, 7:21am

First, the TMF boards were a horrendous experience for about a week.

I’ve been on several of these boards from a few to several times a day since early April and have yet to experience any of that, the boards have run fine.

hmcproperties · July 19, 2020, 7:51am

I believe it was an issue if accessing the boards via Google Chrome. I switched to Firefox for a few days as a work around. Back to Chrome and all is good now.

LakeDaisy · July 19, 2020, 8:58am

I only use Chrome (mac). Haven’t had any problems at all.

rtichy · July 19, 2020, 10:05am

“It could have been an implementation problem and not a Cloudflare problem per se.”

Let’s look at MF’s web wizardry (these boards, e.g.) as an indicator of whether or not they might have mis-configured the Cloudflare product?!

I didn’t see any of the excessive problems being discussed, here. Over the past few weeks I have noticed an increased number of requests to “RE”-login and that login process featured “Captcha” --which worked every time, first try-- and that it seemed to be related to the number of devices I am logged in to MF on. (3 devices, all the time; when I moved between platforms was the only times when I’d get a request to login again.)

Personally, as someone who peruses this board (on the public side) and many boards in the paid services, I’m somewhat impressed that they haven’t screwed up authentication more.

At best, though, all of us discussing Cloudflare’s product line in the context of one customer’s (MF) implementation of something from Cloudflare is “Uninformed Anecdotal” * (# of us)…

etagordon · July 19, 2020, 12:14pm

I wonder if the TMF problem was geographic. I live in southern Virginia and I was required to refresh my browser with nearly every click. This lasted about a week.

SaulR80683 · July 19, 2020, 1:30pm

I believe it was an issue if accessing the boards via Google Chrome. I switched to Firefox for a few days as a work around. Back to Chrome and all is good now.

I was on Safari and never had an error message.
Saul