Outage caused by Crowdstrike update takes down computers around the world

Smorgasbord1 · July 24, 2024, 3:59pm

Not a surprise to me, obviously, as I’ve been posting they have to if they want to survive.

But, my summary on what happened is slightly different than what @stenlis posted, in that there are implied problems not directly stated in Crowdstrike’s latest information release.

TL;DR: Crowdstrike is obfuscating (gaslighting, in my view) what wasn’t tested by highlighting a bunch of testing they did months earlier on different releases. But, they didn’t do ~95% of that testing on the July 19th release of two configuration files! As I said up-thread, that’s because they considered data files “safe.” And since there was another bug in their skimpy “content validator,” that little testing they did was flawed.

__

To be clear Crowdstrike only tacitly admits they did NOT perform integration testing, performance testing, nor stress testing of the July 19th release, and that they bypassed (as they have in the past) their own staged rollout process that customers may have configured, including not rolling out the update internally first. They literally rolled this update out to customers before running it internally!

So, what I said “appears” in my earlier post days ago is now confirmed by Crowdstrike as a tragedy of compounding errors. Worse, I find Crowdstrike’s current description of the testing they do on other types of releases as gaslighting, as if they’re trying to say “we not total dummies, we do know what testing is.” But, the bottom line is they didn’t do the testing they claim to know how to do.

__

OK, here are some more details. Here’s the key part of Crowdstrike in essence saying we’re not total dummies:

The sensor release process begins with automated testing, both prior to and after merging into our code base. This includes unit testing, integration testing, performance testing and stress testing. This culminates in a staged sensor rollout process that starts with dogfooding internally at CrowdStrike, followed by early adopters. It is then made generally available to customers. Customers then have the option of selecting which parts of their fleet should install the latest sensor release (‘N’), or one version older (‘N-1’) or two versions older (‘N-2’) through Sensor Update Policies.

You may recall this follows what I’ve posted earlier. What Crowdstrike then describes has to be evaluated in terms of what they don’t say is tested for this release.

The event of Friday, July 19, 2024 was not triggered by Sensor Content, which is only delivered with the release of an updated Falcon sensor.

Yeah, so what Crowdstrike just did is tell you how Sensor Content is tested, but Sensor Content wasn’t the problem! So why tell us? Maybe they hope people will stop reading by this point and just come away with “this was complicated,” but it isn’t complicated at all.

I’m also omitting additional details from the blog about testing that Crowdstrike did perform in Feb and Mar, since all this prior testing is also irrelevant to what was released on July 19.

They then describe “Rapid Response Content,” which, like I said days ago, is not code but a configuration data file. And then by way of omission we see what happened:

On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

But, Content Validation is NOT a full suite of tests. These “IPC Template Instances” were not, to use Crowdstrike’s own words, put through “unit testing, integration testing, performance testing and stress testing… culminat[ing] in a staged sensor rollout process …”

This is again confirmed with:

Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.

So, oh sure, like we tested stuff back in March so when we added two new files that passed a quickie content validation test we decided all was good and rolled them out to production without integration testing, without performance testing, without stress testing, and bypassing customers staged rollout process.

And then we get to the actual bug, which existed in their code for months (we don’t know how long since Crowdstrike hasn’t said):

When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD).

That is inaccurately put. Crowdstrike is admitting to not checking for an out of bounds memory pointer and so they didn’t prevent the exception from occurring. Checking that memory pointers are within reasonable limits (ie, not NULL) should be a standard part of any such code, especially one inside a boot-time kernel driver. That “unexpected exception” should not have been unexpected and should have been caught and gracefully handled. This is pointer math 101, folks. Kids are graded on stuff like this in undergraduate programming, and have been for half a century. This isn’t new or hard or not expected from even junior programmers.

__

Their fixes involve:

Full testing of all releases, even if they’re just data files:
Fix the Content Validator to actually validate content.
Put memory pointer checks into the kernel driver code so any bad data that does get through is gracefully handled.
Obey the staged rollout process customers thought they had in place.
Add new staged rollout controls

Ironically, Crowdstrike had to add its own Channel File 291 to its “known-bad list” of security exploits.

__

To summarize the tragedy of errors:

A bad data file was generated.
Pre-existing Crowdstrike code did not check for valid memory pointers, a basic defensive programming technique that even junior programmers know to include.
The only test that was performed on this data file was a Content Validator, which itself has bugs.
Integration, Performance, and Stress testing were not performed.
The update was not run internally before releasing to customers.
The update bypassed any staged rollout customers had setup.

The incompetence here, especially for a supposedly world-class security company, is astounding. Move fast and break things might be ok for social media pages, but not for corporate security at this level. It makes one wonder what other parts of Crowdstrike are also incompetently programmed.

Topic		Replies	Views
CrowdStrike, Microsoft, and Apple Stocks A to Z apple	3	164	July 23, 2024
CRWD, now that the dust has settled Saul’s Investing Discussions	9	4796	September 25, 2024
My crowdstrike (CRWD) notes Saul’s Investing Discussions	17	335	July 23, 2019
Crowdstrike criticism on Fool.com Saul’s Investing Discussions	4	188	February 5, 2021
Crowdstrike results - my thoughts Saul’s Investing Discussions	24	553	January 15, 2021

Outage caused by Crowdstrike update takes down computers around the world

Related topics