Outage caused by Crowdstrike update takes down computers around the world

I don’t think they will disclose any more details. Since liability for damages is a possibility then whatever they disclose may be used against them. Their best bet is to just move on and refer any further questions to the statement they had already issued.

But this is just an educated guess. Maybe they’ll surprise me.

5 Likes

I think Crowdstrike lucked out as Biden not running has driven “Crowdstrike” off the front pages, and gives the public time to forget about what happened to some extent.

42 Likes

I completely disagree. At the very least, Crowdstrike will tell existing customers why they don’t need to worry about things like this moving forward, and I expect any such communication, even if private through a login support portal, will get out - and Crowdstrike needs to be prepared to that eventuality. And if Crowdstrike doesn’t tell potential customers what they’re doing/have done to prevent this kind of thing from happening again, Crowdstrike will not only not gain new business, they’ll lose exising customers.

I’m sure many lawyers are scrutunizing the Crowdstrike license agreements (and it’s not unlikely that some big customers had already negotiated different agreements, btw) to see what legal recourse they have for lost revenue due to what is likely going to be described as “gross neglience” on Crowdstrike’s part.

24 Likes

I agree with you and @stenlis. What I mean by that is yes, Crowdstrike needs to communicate with their customers in order provide assurances that this is a one off and measures have already been instituted in order to insure no repeat performances.

However, their lawyers need to take a very hard look at every word that goes on the public record prior to release. Crowdstrike has exposure to what can potentially be enough damage to drive them into bankruptcy. I don’t think I’m exaggerating or overstating the situation.

At this time there’s no way of knowing with any certainty how much financial injury this outage has caused, but it most certainly is enormous and insurance may not cover it.

15 Likes

I would hate to see the version where they were unlucky. :face_with_peeking_eye:

5 Likes

It appears Crowdstrike compounded the problem. According to a comment in this video, Crowdstrike ignored their own staging policy feature. Here’s the comment:

CS Falcon has a way to control the staging of updates across your environment. businesses who don’t want to go out of business have a N-1 or greater staging policy and only test systems get the latest updates immediately. My work for example has a test group at N staging, a small group of noncritical systems at N-1, and the rest of our computers at N-2.

This broken update IGNORED our staging policies and went to ALL machine at the same time. CS informed us after our business was brought down that this is by design and some updates bypass policies.

So in the end, CS caused untold millions of dollars in damages not just because they pushed a bad update, but because they pushed an update that ignored their customers’ staging policies which would have prevented this type of widespread damage. Unbelievable.

This is also claimed in a separate comment/commentor on ycombinator:

What happened here was they pushed a new kernel driver out to every client without authorization to fix an issue with slowness and latency that was in the previous Falcon sensor product. They have a staging system which is supposed to give clients control over this but they pis_sed over everyone’s staging and rules and just pushed this to production.

From: CrowdStrike says flawed update was live for 78 minutes | Cybersecurity Dive

CrowdStrike’s Falcon platform runs on cloud-native architecture and a set of automated tools and processes in the CI/CD pipeline, according to the company’s product descriptions. This includes components for automated testing and a staging environment for quality assurance and A/B testing before the application is deployed into production.

The company has not explained how the defective update, which triggered a logic error resulting in a system crash, made it into the hands of customers. CrowdStrike did not respond to a request for comment.

And here there is some defensive of bypassing staging:

Crowdstrike pushed an “unskippable” update to all of their phone-home endpoints. Anyone set with an N-1 or N-2 configuration (where N represents the most recent version of the software, and the -# is how many versions behind someone chooses to be) had that option ignored.

This is logical for this product in some sense. A 0-day fix needs to be propagated immediately. Being N-1 on a 0-day is not wise.

Everyone believed that CrowdStrike was doing its due diligence in staging before pushing it out to the rest of the world. Obviously, someone in CrowdStrike skipped a step.

So, it appears to me to be a tragedy of compounding errors:

  1. The driver code itself did not check for NULL pointers, bad addresses, etc.
  2. The driver code was not properly checked for dealing with bad inputs, nor had sufficient error handling.
  3. The updated definition file was not internally tested before release.
  4. The update of the definition/data file bypassed Crowdstrike’s own staging policies and so went to all machines at the same time.

Lastly, CrowdStrike Founder and CEO George Kurtz stated that the company will provide full transparency on how this occurred and the steps to prevent anything like this from happening again.

Yeah, as I’ve been saying, they’ve got to come clean, otherwise the pain will last longer. Like ripping the band-aid off, it’ll be better than sweeping the root cause and remediation efforts under a rug. But, who’s going to buy/add Crowdstrike right now? Maybe with a fire sale prices…

30 Likes

Thanks Smorg for the interesting details. As an organization gets bigger, coordination and rules enforcement becomes socially more challenging. Kind of like herding a thousand cats. A big problem is folks jumping off the groomed switchback and trying to blaze up the trail via a “shortcut” by going straight up or down. Now Kurtz has been requested to testify in Congress next week. Not good.

BTW Bert Hochfeld and others have posted recommendations for SentinelOne as the one most likely to benefit. IMO Users of SASE and EDRs will not make a sudden dash to a different product. Endpoint software install is very sticky and procedural. CRWD contracts are usually long term and provide a platform of security applications, not just EDR. But we will likely see some impact to new CRWD sales going forward.

-zane

24 Likes

Surprisingly, Crowdstrike did post additional details:

It’s very wordy and for a large part describes how their Falcon sensor works.
The description of how the failure landed on customer systems is disappointing. Basically they run the files through what they call “Content Validator” and if it says the files are fine, they roll them straight onto customer machines. How could this Content Validator say a file full of zeros was fine? They address that with one line:

Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

That’s it.

In the last section they describe what measures they intend to implement to prevent this from happening again:

mprove Rapid Response Content testing by using testing types such as:

  • Local developer testing
  • Content update and rollback testing
  • Stress testing, fuzzing and fault injection
  • Stability testing
  • Content interface testing
  • Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.

Jeez, these are some really basic measures one would think the world security leader already had in place. Stability testing? You mean like run the application at least once on a windows machine and see if it doesn’t crash? Like the stuff students do in high school CS projects before submitting their work?

IDK, it seems worse to me than before. I assumed they had tested the file, or at least ran it ONCE and it got corrupted afterwards. Their description shows it was broken from the very start and was not tested in any way or manner.

I wish them luck, but oh boy, I would not recommend Crowdstrike to my employer.

13 Likes

Not a surprise to me, obviously, as I’ve been posting they have to if they want to survive.

But, my summary on what happened is slightly different than what @stenlis posted, in that there are implied problems not directly stated in Crowdstrike’s latest information release.

TL;DR: Crowdstrike is obfuscating (gaslighting, in my view) what wasn’t tested by highlighting a bunch of testing they did months earlier on different releases. But, they didn’t do ~95% of that testing on the July 19th release of two configuration files! As I said up-thread, that’s because they considered data files “safe.” And since there was another bug in their skimpy “content validator,” that little testing they did was flawed.

__

To be clear Crowdstrike only tacitly admits they did NOT perform integration testing, performance testing, nor stress testing of the July 19th release, and that they bypassed (as they have in the past) their own staged rollout process that customers may have configured, including not rolling out the update internally first. They literally rolled this update out to customers before running it internally!

So, what I said “appears” in my earlier post days ago is now confirmed by Crowdstrike as a tragedy of compounding errors. Worse, I find Crowdstrike’s current description of the testing they do on other types of releases as gaslighting, as if they’re trying to say “we not total dummies, we do know what testing is.” But, the bottom line is they didn’t do the testing they claim to know how to do.

__

OK, here are some more details. Here’s the key part of Crowdstrike in essence saying we’re not total dummies:

The sensor release process begins with automated testing, both prior to and after merging into our code base. This includes unit testing, integration testing, performance testing and stress testing. This culminates in a staged sensor rollout process that starts with dogfooding internally at CrowdStrike, followed by early adopters. It is then made generally available to customers. Customers then have the option of selecting which parts of their fleet should install the latest sensor release (‘N’), or one version older (‘N-1’) or two versions older (‘N-2’) through Sensor Update Policies.

You may recall this follows what I’ve posted earlier. What Crowdstrike then describes has to be evaluated in terms of what they don’t say is tested for this release.

The event of Friday, July 19, 2024 was not triggered by Sensor Content, which is only delivered with the release of an updated Falcon sensor.

Yeah, so what Crowdstrike just did is tell you how Sensor Content is tested, but Sensor Content wasn’t the problem! So why tell us? Maybe they hope people will stop reading by this point and just come away with “this was complicated,” but it isn’t complicated at all.

I’m also omitting additional details from the blog about testing that Crowdstrike did perform in Feb and Mar, since all this prior testing is also irrelevant to what was released on July 19.

They then describe “Rapid Response Content,” which, like I said days ago, is not code but a configuration data file. And then by way of omission we see what happened:

On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

But, Content Validation is NOT a full suite of tests. These “IPC Template Instances” were not, to use Crowdstrike’s own words, put through “unit testing, integration testing, performance testing and stress testing… culminat[ing] in a staged sensor rollout process …”

This is again confirmed with:

Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.

So, oh sure, like we tested stuff back in March so when we added two new files that passed a quickie content validation test we decided all was good and rolled them out to production without integration testing, without performance testing, without stress testing, and bypassing customers staged rollout process.

And then we get to the actual bug, which existed in their code for months (we don’t know how long since Crowdstrike hasn’t said):

When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD).

That is inaccurately put. Crowdstrike is admitting to not checking for an out of bounds memory pointer and so they didn’t prevent the exception from occurring. Checking that memory pointers are within reasonable limits (ie, not NULL) should be a standard part of any such code, especially one inside a boot-time kernel driver. That “unexpected exception” should not have been unexpected and should have been caught and gracefully handled. This is pointer math 101, folks. Kids are graded on stuff like this in undergraduate programming, and have been for half a century. This isn’t new or hard or not expected from even junior programmers.

__

Their fixes involve:

  1. Full testing of all releases, even if they’re just data files:

  2. Fix the Content Validator to actually validate content.

  3. Put memory pointer checks into the kernel driver code so any bad data that does get through is gracefully handled.

  4. Obey the staged rollout process customers thought they had in place.

  5. Add new staged rollout controls

Ironically, Crowdstrike had to add its own Channel File 291 to its “known-bad list” of security exploits.

__

To summarize the tragedy of errors:

  1. A bad data file was generated.
  2. Pre-existing Crowdstrike code did not check for valid memory pointers, a basic defensive programming technique that even junior programmers know to include.
  3. The only test that was performed on this data file was a Content Validator, which itself has bugs.
  4. Integration, Performance, and Stress testing were not performed.
  5. The update was not run internally before releasing to customers.
  6. The update bypassed any staged rollout customers had setup.

The incompetence here, especially for a supposedly world-class security company, is astounding. Move fast and break things might be ok for social media pages, but not for corporate security at this level. It makes one wonder what other parts of Crowdstrike are also incompetently programmed.

34 Likes

Looking at families still trying to get on to Delta flights 4 days after this debacle and I see major class action suits. I can see there is a possibility of bankruptcy for CrowdStrike.

The class action suit will be something like this. “How much you got? Hand it over!”

Cheers
Qazulight

8 Likes

I concur with the gist of Smorgs post above. Tho there are a lot of assumptions on the technical specifics (including that the config file was full of nulls, which CrowdStrike stated wasn’t the issue at one point early on, and whether code did or did not have null pointer checks) that are bogging down the convo.

Let’s boil it all down for the laymen.

It was essentially a config change (templates) that were pushed in a separate deploy channel than agent updates. They use this separate deploy channel to continuously push out new combinations of sensor checks for newly discovered attack vectors. It also seems this separate config deploy also circumvents any customer staging setups to control rollouts (which only apply to agent updates).

The initial new templates passed fuller QA checks back in March, but a future config template on June 19 had some problem that slipped past one preliminary checker, that then triggered a latent bug in the agent. Whoops.

  • Problem 1: Subsequent config updates past the last agent update in March did not go through a full QA cycle to catch this. Mgmt is saying it will be fully checked on every update going forward.
  • Problem 2: Config updates skipped customer policies on staging (rollout) polices. Mgmt said that config updates will adhere to staging polices set up for agent updates.

Early on in the incident, the Cloudflare CEO correctly surmised on Twitter that it was a config vs app update, as they too have had some major outages due to config changes slipping past QA checks. Difference here is that (unlike a cloud service) Windows kernel agents cannot roll back faulty updates, if Windows goes into failure mode.

The biggest issue IMHO isn’t how it was caused, it was the effect – and what the ramifications are for that.

Critical systems globally failed and were not easily recovered. No automation scripts could be used to back out the faulty update – customers would now have to go to each and every affected machine and correct it via a manual process.. And if they used proper encryption techniques like Windows BitLocker, that manual process was even more cumbersome.

CrowdStrike has caused its own customers a LOT of pain. There was a huge amount of lost productivity (employee systems down) but especially lost business (POS systems down, airline/rail outages, 911 failures, hospital equipment monitoring), which cascaded into having their own set of angry customers. Then add in the huge stress on IT departments to go manually correct this on each an every Windows server (cloud or on-prem) and employee workstations & laptops.

It is likely that critical servers are now back up, but it’s going to take weeks for some global/large orgs to recover what might be tens to hundreds of thousands of Windows machines. Every monitor in airports, every remote POS across a national retail chain, every distributed IoT sensor array for a national pipeline or other remote field operations.

Folks that stress this isn’t a security issue are ignoring the brand damage has occured because CrowdStrike did more damage than Russian hackers could only dream about.

I’m not sure what the liability will be and how it will trickle up through customers and on up the chain – that will take years in the courts to determine.

But the brand damage is immense, and I think the impacts more immediate. At a minimum, they have to dish out massive discounts to keep existing large customers happy [which will only slightly start to compensate them for the IT overtime alone, but be nowhere near the overall business impact]. Lands are going to dry up, and expands might too as folks question the risks of security platformization onto a single vendor.

-muji

50 Likes

Maybe an order of magnitude more, even now still. Microsoft reported that 8.5 MILLION computers were affected, and Crowdstrike itself has only said “a significant number” are back online, which to me implies less than half.

I wonder if the effects will be worse than that. This is not just an unfortunate combination of errors, it’s a bunch of rookie software mistakes that put into question the basic competency of Crowdstrike’s software engineering.

Adopting Crowdstrike almost certainly requires approval of your company’s C-Suite technical people: CSO (Chief Security Officer), or CIO (Chief Information Officer), or CTO (Chief Technical Officer) for instance. Those people are going to look at the underlying causes (and that’s plural) of this event and wonder about what potential issues are in the other 21 of Crowdstrike’s modules. We’re seeing issues in a variety of departments (development, testing, deployment), so it’s unlikely that these kinds of problems don’t exist in many of the other modules Crowdstrike has developed.

And perhaps even worse, this lack of competency, ranging from rookie defensive coding to basic internal testing to bypassing features advertised to customers that depended on them, could mean that Crowdstrike’s system is itself vulnerable to attacks. How any C-Suite officer approves continuing to use Crowdstrike until they can demonstrate that all 22 of their modules are appropriately hardened and tested and obey staging procedures seems unlikely to me.

That said, switching vendors isn’t easy nor quick. It’s not like using Zoom instead of WebEx. But until/unless Crowdstrike cleans house and tells us about reviews of all of the 22 modules and what code analysis they have performed, and testing procedures across the entire line of products, this is going to be bad for their business.

31 Likes

They are not handling the PR side very well. Today this juicy story has blown-up all over social media: CrowdStrike offers a $10 apology gift card to say sorry for outage | TechCrunch

And the worst (and funniest) part:

On Wednesday, some of the people who posted about the gift card said that when they went to redeem the offer, they got an error message saying the voucher had been canceled. When TechCrunch checked the voucher, the Uber Eats page provided an error message that said the gift card “has been canceled by the issuing party and is no longer valid.”

It appears they might have used a multi-use code, and they had to cancel it after it was shared publicly. I can almost hear the Curb Your Enthusiasm theme music playing. Larry David couldn’t have scripted a more cringe-worthy series of missteps if he tried.

10 Likes

This is a phishing scam. Don’t click on any links if you get this email.

5 Likes

There might be phishing scams too now, trying to benefit from all the confusion. But CrowdStrike actually confirmed they sent out these gift cards. :man_facepalming:

From the TechCrunch article:

CrowdStrike spokesperson Kevin Benacci confirmed to TechCrunch that the company sent the gift cards.
“We did send these to our teammates and partners who have been helping customers through this situation. Uber flagged it as fraud because of high usage rates,” Benacci said in an email.

Bloomberg (and others) reported on it too:

A CrowdStrike spokesperson, Kirsten Speas, confirmed that the credits were sent to “teammates and partners who have been helping customers through this situation” but said they didn’t go out to “customers or clients.” Speas didn’t elaborate on who received the gift cards or how many were sent.
The gesture, which was previously reported by TechCrunch, was greeted with some scorn on social media

https://www.bloomberg.com/news/articles/2024-07-24/crowdstrike-sends-10-gift-cards-to-it-workers-as-mea-culpa

6 Likes

@CMF_muji thanks for the explanation and your analysis. I take exception with only one thing, “It was essentially a config change (templates) that were pushed in a separate deploy channel than agent updates.”
To be honest, I really don’t think that flies for most laymen. I have 30 years in IT as my background. Many of those years were spent in application development and testing. I am not confident that fully understand that comment.

7 Likes

I’m not Muji, but perhaps I can add some useful color. Crowdstrike has multiple kinds of updates. Two broad categories would be:

  1. An update of the actual executables (Falcon Sensor Updates).
  2. An update of configuration files used by those executables (Content Configuration Updates).

The recent issue was with a bad configuration file (Channel 291 file) in a Content Configuration Update. That bad file triggered a “logic error” that existed (and unless they’ve done a Sensor Update, still exists) in their executeable code, which causes systems to crash when starting/restarting.

I believe what Muji means by “separate deploy channel” is that the update processes, including validation, testing, and roll-out scheme, are different for Sensor Updates than for Content Configuration Updates.

Furthermore, however, there were different processes for the two kinds of Content Configuration Updates:

Sensor Content that is shipped with our sensor directly, and Rapid Response Content that is designed to respond to the changing threat landscape at operational speed.

The issue on Friday involved a Rapid Response Content update with an undetected error.

This Rapid Response Content Update was validated only with a “Content Validator” program, according to Crowdstrike. This update was not even loaded onto one sample machine inside of Crowdstrike before being sent out, and yet it was sent to all relevant customer machines, even if the customer had specified some machines to only get delayed updates (Canary or Staged Rollout parameters being bypassed). Apparently, Crowdstrike viewed these Rapid Response Content Updates as very safe.

Crowdstrike now vows to be more careful in the future, and to perform not just more QA on all updates, but to add better development processes (including reviews by third parties), and to have all updates obey customer specified staged rollout parameters.

That’s good news, but these are standard things that any large software company, especially one that supplies kernel drivers and handles system security, should have been doing since day one, especially since Crowdstrike’s CEO was a C-level exec at McAfee when it suffered a similar data file caused crashing bug.

This Barron’s article discusses that a bit:

CrowdStrike’s current CEO George Kurtz was the chief technology officer of McAfee in 2010 when that company put out an eerily similar faulty security update that crashed tens of thousands of computers around the world and required a tedious manual fix of deleting a system file.

When asked about Kurtz’s history at McAfee, a CrowdStrike spokesperson said in a statement, “George was there as a sales-facing CTO, not in charge of engineering, technology, or operations.”

That misses the point. Kurtz had seen firsthand the impact a bad data file could have, and even if the McAfee bug wasn’t his fault, he should have taken steps so his new company’s customers would not suffer the same fate. But, he didn’t. Will customers forgive him? Will investors? Will you?

28 Likes

Smorg, thanks for the explanation. After CRWD dropped a bunch I thought it was on sale so I sold my CELH which I was looking of an excuse to exit and bought CRWD. But after I thought about it more carefully, I decided that buying it was rash and I sold the next day - at a loss. I don’t know if they will recover. There was a lot of financial damage as a result of this blunder that should never have happened. If I were a CTO/CIO of a firm that was badly impacted I would most certainly be working with my legal folks in order to figure out what to do.

10 Likes

CRWD was my largest position at 25%. I had trimmed twice as it got over 20% previously. That’s the history.

After consideration, I ended up selling half of my position (all shares in IRA).

But the position is still too large. I’ll be selling at least half of my shares with tax consequences. The remaining will be roughly 5%.

I actually do believe CRWD will come back from this but I expect it to be a long slog.

I’ll rotate some of the cash to S which is only a 3% position. I do believe S will benefit from this.

AJ

9 Likes

CRWD was my largest position around 10% before the drop. S may benefit short term but they have their own growth related issues and as a whole have inferior product line.

This has been very frustrating two weeks but I don’t plan on taking a tax hit. While there was a lot of noise especially due to the airlines a lot of CrowdStirke customers have recovered smoothly and I believe most damage will be covered by the insurance companies and not go to litigation.

While there be some churn at the end of some of the current customer service agreements and loss of new business in this quarter I believe CrowdStirke can recover within the next 6-12 months. If you look at the market you actually don’t have that many good options, you have Microsoft and SentinelOne but if CIOs wanted to go with them, they could have done that already since CrowdStirke is not the cheapest option. My strategy is to wait and see how things develop.

10 Likes