The Economics and Mechanics of Ransomware

A prior thread on the ransomware attack at UnitedHealth pointed out a few practical operational considerations of large systems that may not be understood by non-technical outsiders. The designs of many systems used by large orgnizations make them particularly brittle in the event of these types of attacks. Having spent considerable time on this topic in Corporate America, I attempted to summarize these characteristics for people to better understand as customers of these organizations, as potential investors and as general citizens.

---------------------------------------

Recent stories about a ransomware attack on UnitedHealth and its extended impact on United's daily operations for nearly a month generated conversation online and (likely) in many boardrooms about the real possibilities of harm from ransomware and the readiness of organizations to recover from such attacks. Such concerns are very appropriate. I suspect the directions provided by leaders in such organizations to prepare for such attacks are NOT wholly appropriate because it is likely many IT and security experts within those organization and their leaders lack a clear understanding of the design characteristics of their current systems and the operational impacts of those designs, especially in the scenario where all data has been purposely destroyed or made unavailable.


A Key Distinction

The key distinction to make in the realm of harmful software is the intent of the originator. Software termed malware is designed to harm its target in some way and generally includes no capability to "un-do" its damage. The creators' motivation might be spite, like a bunch of teenagers sitting in an online chatroom, or it could be an economic or political entity attempting to seriously harm an opponent. That harm could stem from possibly irreversibly scrambling data or using existing software for unintended purposes (like spinning centrifuges at twice their rated speed to destroy them and prevent refinement of uranium). If data is the only object damaged in a malware attack, the ability of the target to recover the data depends upon the competency of the malware creators. If they knew what they were doing and truly randomly scrambled or simply overwrote the data, there's no practical way to recover using the data in place. Backups are the only avenue of recovery.

Software termed ransomware isn't designed to permanently harm a target (though permanent harm CAN result). It is instead a tool used as part of a business model for extorting cash from victims. Creators of ransomware want to make money. It's impossible for them to make money if their ransomware mechanism proves to be significantly less than 100% reversible. If a target is attacked with a ransomware tool that has scrambled 50 other large companies and only five were able to recover using the creator's recovery process, few other firms will pay the ransom and the creator's business model will collapse and if they continue, their tool ceases being ransomware and has the same effect as other malware.


Attack Architecture

Malware and ransomware ("hackware") are both very similar in one key area. Both adopt similar "architecture" in the layers of software used because both require the same appearance of an INSTANT attack to achieve their goal and avoid being disabled. Of course, if attacking a large corporation with 50,000 employees with 50,000 laptops and two data centers with 10,000 servers running internal and external e-commerce systems, it is IMPOSSIBLE to literally attack all 60,000 machines simultaneously. Most hackware is designed in layers that, for purposes of explanation, will be termed shim, full client and command ∓ control.

The shim layer is the piece of software that exploits software the target is already running to make that software do something it wasn't INTENDED to do but is PERMITTED to do. Ideally, this additional action LOOKS like regular activity that "machine zero" might perform to avoid triggering alerts about an unexpected process running or an unexpected attempt to reach some other remote resource. Note that the software targeted by the shim is NOT necessarily the ultimate target of the hackware. That initial point of infection is only the weak link being exploited because the hackware creators learned how to corrupt it to do something else useful in their attack and the target company happens to run that software. In the SolarWinds attack of late 2020, data managed by SolarWinds within a target company was NOT the actual target of the attack. It was just a widely use piece of enterprise software with a vulnerability the hackers learned to exploit.

The exploit leveraged by the "shim" layer may not allow a large enough change in the software being corrupted to perform the real action to be invoked by the hackware. The shim may instead target OTHER installed software or install NEW software to actually implement the real bad action to be performed at the time of the eventual attack. That software is the real "client" of the attack. Since most PCs and servers run anti-virus software looking for unexpected binaries or new processes, the client layer of most hackware relies upon being able to masquerade as something already allowed or upon being able to interfere with those scanning processes and discard their alerts. The key concept to understand at this point in the narrative is that the time of initial infection (by the shim) or "full infection" (by the client) is NOT the time of the attack. The process of infecting dozens / hundreds / thousands of machines while evading security monitoring tools takes time. Not just hours or days. Weeks. Months. (This has huge cost impacts on mitigation strategies to be explained later.)

Since full infection can take an extended period yet the goal of the hackware is to appear to attack simultaneously, most large scale hackware attacks leverage an external "command and control" layer which performs multiple tasks. It tracks "pings" from each infected machine to trace the progress of the original "shim" infection or the "full client" infection. In many cases, the hackware creators aren't targeting a particular organization in advance, they are learning who they infected via this telemetry and deciding if they want to ALLOW the attack. Since this telemetry can disclose public IP addresses of the infected machines, those addresses can help the hackware creators confirm the size of the target and decide how long to wait for additional infections before triggering the actual attack onset. For example, if a PING comes from IP 201.14.92.52 and that is part of a block operated by Joe's Bait & Tackle, the originators may just skip him. If the block is operated by Gitwell Regional Hospital in Podunk, AR that operates 90 beds, they might wait for another 40 or 50 machines to PING before triggering attack. If the block belongs to Ford Motor Company and only 4000 machines have PINGed in, they may wait until they see 50,000 to 60,000 before pulling the trigger.

The process of "pulling the trigger" is also designed in a way to avoid detection. Obviously, a firm whose security software sees 60,000 laptops all continuously polling some IP address in Russia is likely to detect that and get a heads up that trouble is looming. Instead, the "full client" running on each infected machine may be written to "poll" for instructions on a random interval over DAYS to get the final green light date and time of attack. Since most laptops and servers in Corporate America use NTP (Network Time Protocol) to sync onboard clocks down to the millisecond, once thousands of infected systems learn the attack date and time, they all just wait for that time to arrive and do not have to sync with the mother ship or each other to yield a simultaneous onset of the attack. Included with the green light and attack date/time will be a cryptographic key each client should use to generate an internal key to encrypt the data. If the attacker actually does honor any ransom payment, the command and control system will also signal the "paid" status the clients can use to reverse the encryption.


Ransomware Recovery

As mentioned before, options for recovery from a malware attack are slim. If the infection actually reached the onset phase, there will usually be no available method of recovering data. The creator had no such intent for recovery to be possible and the victim will likely lack the time and expertise required to reverse-engineer the malware, determine HOW it worked and whether any recovery is possible and code a fix. The only path forward is to identify all infected machines, quarantine them, then wipe and re-load each infected machine with clean software. If any infected machine remains on the network with clean machines, the infected machine can re-infect newly cleaned machines, getting you nowhere.

For ransomware, victims have to approach recovery with a split brain. On one hand, because it is ransomware, a short-term restoration may be POSSIBLE but only if the victim's leadership and legal counsel can agree upon a ransom amount and only if the attacker's recovery process actually works. If the victim is among the first victims of a new ransomware variant and the recovery process cannot be verified before paying the ransom, the victim may be taking a huge risk. Even if the recovery appears to work, the victim will STILL need to literally wipe and reload EVERY machine touched by the ransomware, whether it triggered on that machine or not. Once a machine has been compromised, it will require a complete reload. This process can still require the victim to incur multiple outages, extended maintenance windows, etc. as the production applications are migrated to new, wiped machines while other infected machines are systematically taken offline, wiped, reloaded and brought back online. And the victim will need to audit every BYTE of affected data to ensure no data was altered intentionally or inadvertently by the ransomware process.

For victims, the other half of their split brain process requires proceeding as though the ransom WON'T be paid or WON'T WORK and they have to begin recovering from backups. At this point, the complexity factor and costs grow exponentially for the victim. No large corporation operates a single monolith application with a single database with contents reflecting the entire state of all customer / employee / vendor relationships at a single point in time. Functions are spread across DOZENS of systems with specific data elements acting as "keys" linking a record in SystemA to a record in SystemB which maps to a different table in SystemB with a key mapping SystemB to SystemC, etc. Each of these individual systems may house records for millions of customers over years of activity and may be terabytes in size.

For large systems, the databases housing their records support multiple approaches for "backing up" data in the event of hardware failures, mass deletion (by accident or malice) or corruption. A "full backup" is exactly what it sounds like. Making a copy of every database table, database index, database sequence, etc. involved with an application and moving that copy to some other storage. If the database is one terabyte in production, that full backup will also take up one terabyte. In most companies, a full backup is created monthly or weekly. An "incremental" backup uses the database's ability to identify all records / changes made AFTER an explicit point in time (a "checkpoint") and copy just those records to a separate set of files tagged with that checkpoint. Incremental backups are typically taken every week or every X days.

By performing FULL and INCREMENTAL backups, if data is completely lost in production, the newest FULL BACKUP can be restored first, then all INCREMENTAL backups performed AFTER that full backup can be restored atop the full backup to restore the system to a state as close to the present as possible. As an example, a firm making MONTHLY full backups and WEEKLY incremental backups should never lose more than one week of data if they have to restore a corrupted system. Narrowing that potential data loss window involves reducing the intervals between the full and incremental backups but doing that is not pain free or cost free. More frequent backups require more disk storage and more network capacity between the database servers and the SANs housing the storage. If backups are to be copied offsite for additional protection against corruption or acts of nature, the storage and network costs can easily double.

The REAL complexity with recovery of lost databases lies in the synchronization of data across systems and across those backup images. To illustrate the problem, imagine a company with ten million customers that gains 80,000 customers per month and divides its online commerce, billing, inventory, shipping, agent support and customer support functions across five systems. For each customer, there will be

* customer account numbers
* order numbers
* serial numbers in inventory
* order numbers, shipping numbers and serial numbers in shipping
* trouble ticket numbers in the agent support function
* customer account / login / serial number information in the customer support function

With that customer activity, one could imagine daily order activity reaching 2,667. If most of that activity is concentrated over 16 hours, that's 167 orders per hour. (Multiply these numbers by 1000 if they aren't sufficiently scary.)

Even if the firm is creating backups for these systems religiously on a fixed schedule, there is no way to synchronize the backups to start and complete at EXACTLY the same time. One backup might finish in 20 minutes, another might take 2 hours. When each of those backups completes, they might all reflect data as of 3/14/2024 but their actual content might vary by 100 minutes worth of online orders, etc. If the company becomes a ransomware victim and these systems are restored from full / incremental backups, it is possible for a system that ASSIGNS a key such as "ordernumber" to be BEHIND other systems which reflect ordernumber values used prior to the ransomware corruption. For example,

BILLING: newest values prior to ransomware attack:
* 1111122
* 1111123
* 1111124
* 1111125 <--- the absolute most recent assigned ORDERNUMBER

SHIPPING: newest values seen prior to ransomware attack:
* 1111122
* 1111123
* 1111124
* 1111125 <--- in sync with billing (good)

If these systems are restored from backup and the newest combination of (full + incremental) backup for BILLING winds up two hours BEHIND the newest combination of (full + incremental) backup for SHIPPING, the victim could wind up in this state:

BILLING: newest values prior to ransomware attack:
* 1111074
* 1111075
* 1111076
* 1110791 <--- missing 334 values from 1110791 thru 1111125 -- danger

SHIPPING: newest values seen prior to ransomware attack:
* 1111122
* 1111123
* 1111124
* 1111125 <--- correct but "ahead" of BILLING by 334 values (not good)

With this content, if the BILLING system isn't "advanced" to use 111126 as the next assigned ORDERNUMBER before being restored to live use, it will generate NEW orders for 334 ordernumbers that have already been assigned to other customers. This scenario can create two possible problems. At best, it can create an obvious relational data integrity conflict that might cause the application to experience previously unseen errors that block an order from being processed, etc. A far worse scenario is that the applications cannot detect the duplication of ORDERNUMBER values and use them for a second customer. This might allow customers to begin seeing the order contents and customer information of a different customer's order.

This is one scenario where all of the system backups were nominally targeted to run on the same full and incremental frequencies at approximately the same dates and times. What if March 14, 2024 at 10:00pm was chosen as everyone's restoration target but the incremental backup for DatabaseC for System C for that date/time is corrupt for other hardware reasons (hey, stuff happens...) and the next most recent backup is from two days prior? Now that data gap could be two days worth of transactions, posing a much wider opportunity for duplicate keys that will either breach confidentiality or trigger low level database faults in the application causing more actions to fail for employees and customers.

It is possible to design new systems from scratch to synthesize random values for use in joining these types of records together to avoid this synchronization problem. However, most existing systems built over the last twenty years were not designed with these types of recovery scenarios in mind. They were designed for relational integrity in a perfect world where no one could corrupt the assigned keys and a new true state never had to be re-assembled from out of sync backups. Since the systems weren't DESIGNED with these potential problems in mind, no actual administrators or developers have contemplated the work required to analyze the data in the recovered systems, identify the newest keys present in each, then alter all other databases to skip to the newest values while the gaps in the older records are filled back in by hand.


Ransomware Mitigation Strategies

AI as an answer? In a word, NO. There may be consulting firms willing to pocket millions in professional services and develop a slide presentation for executives stating an organization has covered its risks from ransomware. However, no honest firm can state confidently that an organization will a) AVOID a ransomware attack or b) RECOVER from it without affecting its mission and incurring exorbitant expenses in recovery. Why?

First, existing security software generalizes PRIOR attack schemes to devise heuristics to apply to current telemetry to identify potentially similar system behaviors characteristic of known threats. These systems to date cannot PREDICT a novel combination of NEW infection paths and behaviors to spot a new attack in its tracks before an organization becomes "victim zero" for that attack.

Second, the ultimate meta version of the first argument is that artificial intelligence will be capable of overcoming the limitation of analyzing PRIOR attacks for automating discovery and prevention. Certainly, security firms will become touting "AI" capabilities in pitches to executives trying to convince them that AI-based intrusion detection systems will solve the rear-view mirror problem. By definition, applying AI to this problem requires ceding control of an organization's system TO an AI system. By definition, if the AI system is worth its licensing fees, it has "understanding" of security vulnerabilities and protective measures that humans do NOT have or are incapable of explaining. But if the AI system's "understanding" is beyond human understanding, that same AI system could be compromised in ways which facilitate its exploitation in an attack and the victim might have even less knowledge of the initial "infection phase" or ways of recovering if the AI is in fact complicit.

Is this far fetched? ABSOLUTELY NOT. The SolarWinds breach that occurred in late 2020 had nothing to do with AI technology. However, it was an example of a supply-chain based breach. As more network monitoring and security systems are integrated with external data to provide more information to identify new threats, the instances of those systems within companies are trusted to update more of their software AUTOMATICALLY, in REAL TIME, from the vendor's systems. If an attacker wants to attack Companies A, B, C and D but knows they all use AI-based vendor X for intrusion detection, an attacker can instead work to compromise vendor X's system to deliver their "shim" directly to A, B, C and D without raising an eybrow. Administrators at A, B, C and D can diligently verify they are running the latest sofware from X, they can verify the digital signatures on that software match those on the vendor's support site and they'll have zero clue the software from X is compromised. In the case of the SolarWinds attack, the shim had been embedded within a SolarWinds patch that had been installed at many customers for MONTHS prior to it waking up to perform its task.

Air-gapped systems.? The value of air-gapped systems is complicated. Creating completely isolated environments to run back-up instances of key applications or house the infrastructure managing backups is recommended by many security experts. Conceptually, these environments have zero electronic connectivity (Ethernet or wifi) to "regular" systems, hence the term air gap. This separation is required because of the long potential interval between "infection time" and "onset time." An attacker who is patient or is coordinating a very complicated, large scale attack may stage the infection phase over MONTHS. (Again, in the SolarWinds case, some victims found their systems had been compromised for three to six months.) Once a victim realizes their systems are corrupt, they need a minimum base of servers, networking, file storage and firewall hardware they "know" is not compromised to use to push clean operating system images, application images and database backups back to scrubbed machines. If this "recovery LAN" was exposed to the regular network that is infected, then the "recovery LAN" gear could be infected and the victim has no safe starting point from which to rebuild.

Any organization implementing an air-gapped environment must be realistic in the costs that will be involved to build it and maintain it. An air-gapped environment can be pitched in two ways. One is strictly as a "seed corn" environment -- one sized only to handle the servers and storage required to house backups of databases, backups of server images (easier to do with modern containerization) and routers, switches and firewalls required to use the "air gap" environment as the starting point to venture back out into the rest of the data center as racks are quarantined, wiped, reloaded and put back online. A second way some organizations are trying to think of an air-gapped environment is as a second form of a disaster recovery site -- one housing enough servers, storage, networking and firewall resources to operate the most critical applications in a degraded, non-redundant state. Duplicating hardware for even this reduced set of capability is very expensive to build.

More importantly, regardless of which flavor (seed corn or mini DR) is pursued, this extra air-gap environment is a perpetual expense for hardware, licensing and personnel going forward. The likelihood of most senior management teams agreeing to this large uptick in expense and consistently funding it in future years is near zero. Part of the reason most systems in large corporations exhibit the flaws already described is because "business owners" who wanted the applications are only willing to support funding when they are new and delivering something new of value to "the business." Once that becomes the current state, extracting funding from those "business users" to MAINTAIN that current state and modernize it even if new functionality isn't provided involves rules of logic and persuasion which only function in the Somebody Else's Problem Field. (I will give the reader a moment to search that reference...)

Active / Archive Data Segmentation One important strategy to minimize operational impacts of a ransomware attack and minimize recovery windows is to more explicitly and aggressively segment "active" data needed to process current actions for users / customers from "archival" data used to provide historical context for support or retention for regulatory compliance. An application serving ten million customers over the last five years may have a terabyte of data in total but the space occupied to reflect CURRENT customers and CURRENT pending / active orders might only be five percent of that terabyte. In this scenario, continuing to keep ALL of that data in a single database instance or (worse) in single database TABLES for each object means that if the database is lost or corrupted, a terabyte of data will need to be read from backup and written to a new empty recovery database before ANY of the records are available. The same would be true if the database was lost due to innocent internal failures like a loss of a SAN housing the raw data.

Applications for large numbers of customers can be designed to archive data rows no longer considered "current" to parallel sets of tables in the same database or (preferably) in a separate database so the current data required to handle a new order or support an existing customer with a new problem can be restored in a much shorter period of time. Data considered archival can be treated as read-only and housed on database instances with no insert / update capabilities behind web services that are also read only. Network permissions on those servers can be further restricted to limit any "attack surface" leveraged by ransomware.

From experience, virtually no home-grown or consultant-designed system for Corporate America takes into account this concept. This results in systems that require each "full backup" to clone the ENTIRE database, even when ninety percent of the rows haven't changed in two years. More importantly, when the system is DOWN awaiting recovery, a restoration that should have taken 1-2 hours might easily take 1-2 DAYS due to the I/O bottlenecks of moving a terabyte of data from backup media to the server and SAN. If there are six systems all involved in a firm's "mission critical" function, these delays mean you have ZERO use of the function until ALL of the systems have completed their database restoration.

Designing for Synchronicity Issues No application I encountered in thirty plus years of Corporate America was designed with a coherent view of overall "state" as indicated by key values across the databases involved with the system. Most systems were organically designed to start from id=0 and grow from there as transactions arrived. They were NOT designed to allow startup by looking at a different set of tables that identified a new starting point for ID values of key objects. As a result, recovery scenarios like those above where not all objects could be recovered to the same EXACT point in time create potentially fatal data integrity problems.

Going forward, software architects need to explicitly design systems with a clear "dashboard" regarding the application's referential integrity to allow ID values of all key objects to be tracked as "high water marks" continuously and to allow those high water marks to be adjusted upwards prior to restoring to service to avoid conflicts if some databases had to revert to older points in time with older ID values.

Ideally, for an entirely new system with no integrations to legacy systems, use of any sequential or incrementing ID value in the system's data model should be avoided. Doing so would allow a system to be resurrected after infection / corruption and begin creating new clean records immediately without the need to "pick up where it left off" without an accurate view of where that point really was. This is a very difficult design habit to break since many other business metrics rely on ID values being sequential to provide a quick gauge of activity levels.

Integrated Training for Architects, Developers and Administrators The roles of architects, software developers and administrators are highly specialized and often rigidly separated within large organizations. It is safe to say most architects are unfamiliar with the nuts and bolts of managing and restoring database backups across legacy SQL technologies, cache systems like Cassandra, Redis or memcached or "big table" based systems like Hadoop and Big Table. Most developers are not exposed to the entire data model of the entire application and instead only see small parts in a vacuum without context. Most administrators (database admins and server admins) know operational procedures inside and out but aren't involved at design or coding time to provide insight on model improvements or spot potential problems coming from design.

Because of these behavioral and organization silos, very few applications are being designed with the thought of gracefully surviving a massive loss of data and restoration of service within hours rather than days or weeks. Allowing for more coordination between these areas of expertise takes additional time and these teams are already typically understaffed for the types of designs common ten years ago, much less the complexities involved with "modern" frameworks. One way to allot time for these types of discussion is to at least add time to new system projects for analysis of disaster recovery, failover from one environment to another and data model considerations. However, past experience again indicates that attempts to delay a launch of a new app for "ticky dot" stuff is akin to building a dam then asking EVERYONE to wait while engineers sanity test the turbines at the bottom. Even though a misunderstanding of the design or build flaw MIGHT cause the dam to implode once the water reaches the top, there will always be executives demanding that the floodgates be closed to declare the lake ready for boating as soon as possible. Even if a problem might be found that proves IMPOSSIBLE to solve with a full reservoir.

The More Sobering Takeaway from Ransomware

As mentioned at the beginning, there are few meaningful technical distinctions between the mechanisms malware and ransomware exploit to target victims. The primary difference lies in the motivation of the attacker. As described in the mechanics of these attacks, an INFECTION of a victim of ransomware does not always lead to a actual loss of data by that victim. It is possible far more victims have been infected but were never "triggered" because either the attackers didn't see a revenue opportunity that was big enough or they saw a victim so large in size or economic / political importance the attackers didn't want to attract the law enforcement focus that would result. The economics of ransomware work best on relatively small, technically backward, politically unconnected victims who can pay six or seven figure ransoms and want to stay out of the news.

Ransomware creators have likely internalized another aspect of financial analysis in their strategy. The cost of creating any SINGLE application capable of operating at Fortune 500 scale is typically at least $10 million dollars and the quality of such applications is NOT good, whether measured by usability, functionality, operational costs or security. The cost of integrating MULTIPLE systems capable of operating at Fortune 500 scale to accomplish some function as a unit can readily approach $30-50 million dollars and since the quality of the pieces is typically poor, the quality of the integrated system is typically worse.

Leadership is typically convinced that if it costs serious dollars to get crap, it will cost even more to FIX crap and even more to design a system that ISN'T crap from the start. Since current systems for the most part "work" and the company isn't appearing on the front page of the Wall Street Journal, leaders adopt a "Get Shorty" mindset regarding spending money to fix or avoid flaws in their enterprise systems that will only arise in a blue moon. "What's my motivation?"

Well, as long as they don't get hit, there is no motivation. If they DO get hit but the ransom is only (say) $1 million dollars, they look at that as a sophisticated, rational risk taker and say "I avoided spending $10 million to fix the flaw, I got burned but it only cost me $1 million and a couple of days of disruption? I'm a genius." Frankly, that is the mindset the attackers are betting on. If they started charging $20 or $30 million to return a victim's data, those victims would definitely be rethinking their IT strategy, vulnerabilities would decline somewhat, and fewer companies would pay.

As stated before however, that mindset of rationalized complacency does NOTHING to protect an organization if the attacker actually wants to damage the company. This is a sobering point because the same attack mechanisms COULD be used at any point for far wider economic, social or military damage. These more drastic outcomes are NOT being avoided because the money spent on existing intrusion detection mitigation tools is necessarily "working" or that large corporations are simply ready for these failures and are curing them quickly / silently as they arise. These more drastic outcomes are not yet happening solely because those with the the expertise to initiate them are still choosing to limit the scope of their attacks and in many cases are in it for money. Businesses and governments are NOT prepared to fend off or recover from the type of damage that can result if these same capabilities are leveraged more widely for sheer destruction. In a data-driven world, mass destruction of data can directly cause mass destruction and disruption at a scale not previously contemplated. Organizations who state with confidence they are ready for what comes their way really aren't grasping the full picture and are likely in denial about the survivability of their systems.


WTH

11 Likes

For medical related services this is life and death in the larger population numbers. There needs to be regulation. The reason to attack medical institutions is because the imperative is life and death.

2 Likes

WTH,

I hope this is in keeping with the topic. It is the other side of the discussion. Bruce Schneier’s newsletter is free for the public to disseminate. This essay by Josephine Wolff is meant to be publicly disseminated as well.

Without further comment by me.

A Cyber Insurance Backstop

[2024.02.28] In the first week of January, the pharmaceutical giant Merck quietly settled its years-long lawsuit over whether or not its property and casualty insurers would cover a $700 million claim filed after the devastating NotPetya cyberattack in 2017. The malware ultimately infected more than 40,000 of Merck’s computers, which significantly disrupted the company’s drug and vaccine production. After Merck filed its $700 million claim, the pharmaceutical giant’s insurers argued that they were not required to cover the malware’s damage because the cyberattack was widely attributed to the Russian government and therefore was excluded from standard property and casualty insurance coverage as a “hostile or warlike act.”

At the heart of the lawsuit was a crucial question: Who should pay for massive, state-sponsored cyberattacks that cause billions of dollars’ worth of damage?

One possible solution, touted by former Department of Homeland Security Secretary Michael Chertoff on a recent podcast, would be for the federal government to step in and help pay for these sorts of attacks by providing a cyber insurance backstop. A cyber insurance backstop would provide a means for insurers to receive financial support from the federal government in the event that there was a catastrophic cyberattack that caused so much financial damage that the insurers could not afford to cover all of it.

In his discussion of a potential backstop, Chertoff specifically references the Terrorism Risk Insurance Act (TRIA) as a model. TRIA was passed in 2002 to provide financial assistance to the insurers who were reeling from covering the costs of the Sept. 11, 2001, terrorist attacks. It also created the Terrorism Risk Insurance Program (TRIP), a public-private system of compensation for some terrorism insurance claims. The 9/11 attacks cost insurers and reinsurers $47 billion. It was one of the most expensive insured events in history and prompted many insurers to stop offering terrorism coverage, while others raised the premiums for such policies significantly, making them prohibitively expensive for many businesses. The government passed TRIA to provide support for insurers in the event of another terrorist attack, so that they would be willing to offer terrorism coverage again at reasonable rates. President Biden’s 2023 National Cybersecurity Strategy tasked the Treasury and Homeland Security Departments with investigating possible ways of implementing something similar for large cyberattacks.

There is a growing (and unsurprising) consensus among insurers in favor of the creation and implementation of a federal cyber insurance backstop. Like terrorist attacks, catastrophic cyberattacks are difficult for insurers to predict or model because there is not very good historical data about them – and even if there were, it’s not clear that past patterns of cyberattacks will dictate future ones. What’s more, cyberattacks could cost insurers astronomic sums of money, especially if all of their policyholders were simultaneously affected by the same attack. However, despite this consensus and the fact that this idea of the government acting as the “insurer of last resort” was first floated more than a decade ago, actually developing a sound, thorough proposal for a backstop has proved to be much more challenging than many insurers and policymakers anticipated.

One major point of issue is determining a threshold for what types of cyberattacks should trigger a backstop. Specific characteristics of cyberattacks – such as who perpetrated the attack, the motive behind it, and total damage it has caused – are often exceedingly difficult to determine. Therefore, even if policymakers could agree on what types of attacks they think the government should pay for based on these characteristics, they likely won’t be able to calculate which incursions actually qualify for assistance.

For instance, NotPetya is estimated to have caused more than $10 billion in damage worldwide, but the quantifiable amount of damage it actually did is unknown. The attack caused such a wide variety of disruptions in so many different industries, many of which likely went unreported since many companies had no incentive to publicize their security failings and were not required to do so. Observers do, however, have a pretty good idea who was behind the NotPetya attack because several governments, including the United States and the United Kingdom, issued coordinated statements blaming the Russian military. As for the motive behind NotPetya, the program was initially transmitted through Ukrainian accounting software, which suggests that it was intended to target Ukrainian critical infrastructure. But notably, this type of coordinated, consensus-based attribution to a specific government is relatively rare when it comes to cyberattacks. Future attacks are not likely to receive the same determination.

In the absence of a government backstop, the insurance industry has begun to carve out larger and larger exceptions to their standard cyber coverage. For example, in a pair of rulings against Merck’s insurers, judges in New Jersey ruled that the insurance exclusions for “hostile or warlike acts” (such as the one in Merck’s property policy that excluded coverage for “loss or damage caused by hostile or warlike action in time of peace or war by any government or sovereign power”) were not sufficiently specific to encompass a cyberattack such as NotPetya that did not involve the use of traditional force.

Accordingly, insurers such as Lloyd’s have begun to change their policy language to explicitly exclude broad swaths of cyberattacks that are perpetrated by nation-states. In an August 2022 bulletin, Lloyd’s instructed its underwriters to exclude from all cyber insurance policies not just losses arising from war but also “losses arising from state backed cyber-attacks that (a) significantly impair the ability of a state to function or (b) that significantly impair the security capabilities of a state.” Other insurers, such as Chubb, have tried to avoid tricky questions about attribution by suggesting a government response-based exclusion for war that only applies if a government responds to a cyberattack by authorizing the use of force. Chubb has also introduced explicit definitions for cyberattacks that pose a “systemic risk” or impact multiple entities simultaneously. But most of this language has not yet been tested by insurers trying to deny claims. No one, including the companies buying the policies with these exclusions written into them, really knows exactly which types of cyberattacks they exclude. It’s not clear what types of cyberattacks courts will recognize as being state-sponsored, or posing systemic risks, or significantly impairing the ability of a state to function. And for the policyholders’ whose insurance exclusions feature this sort of language, it matters a great deal how that language in their exclusions will be parsed and understood by courts adjudicating claim disputes.

These types of recent exclusions leave a large hole in companies’ coverage for cyber risks, placing even more pressure on the government to help. One of the reasons Chertoff gives for why the backstop is important is to help clarify for organizations what cyber risk-related costs they are and are not responsible for. That clarity will require very specific definitions of what types of cyberattacks the government will and will not pay for. And as the insurers know, it can be quite difficult to anticipate what the next catastrophic cyberattack will look like or how to craft a policy that will enable the government to pay only for a narrow slice of cyberattacks in a varied and unpredictable threat landscape. Get this wrong, and the government will end up writing some very large checks.

And in comparison to insurers’ coverage of terrorist attacks, large-scale cyberattacks are much more common and affect far more organizations, which makes it a far more costly risk that no one wants to take on. Organizations don’t want to – that’s why they buy insurance. Insurance companies don’t want to – that’s why they look to the government for assistance. But, so far, the U.S. government doesn’t want to take on the risk, either.

It is safe to assume, however, that regardless of whether a formal backstop is established, the federal government would step in and help pay for a sufficiently catastrophic cyberattack. If the electric grid went down nationwide, for instance, the U.S. government would certainly help cover the resulting costs. It’s possible to imagine any number of catastrophic scenarios in which an ad hoc backstop would be implemented hastily to help address massive costs and catastrophic damage, but that’s not primarily what insurers and their policyholders are looking for. They want some reassurance and clarity up front about what types of incidents the government will help pay for. But to provide that kind of promise in advance, the government likely would have to pair it with some security requirements, such as implementing multifactor authentication, strong encryption, or intrusion detection systems. Otherwise, they create a moral hazard problem, where companies may decide they can invest less in security knowing that the government will bail them out if they are the victims of a really expensive attack.

The U.S. government has been looking into the issue for a while, though, even before the 2023 National Cybersecurity Strategy was released. In 2022, for instance, the Federal Insurance Office in the Treasury Department published a Request for Comment on a “Potential Federal Insurance Response to Catastrophic Cyber Incidents.” The responses recommended a variety of different possible backstop models, ranging from expanding TRIP to encompass certain catastrophic cyber incidents, to creating a new structure similar to the National Flood Insurance Program that helps underwrite flood insurance, to trying a public-private partnership backstop model similar to the United Kingdom’s Pool Re program.

Many of these responses rightly noted that while it might eventually make sense to have some federal backstop, implementing such a program immediately might be premature. University of Edinburgh Professor Daniel Woods, for example, made a compelling case for why it was too soon to institute a backstop in Lawfare last year. Woods wrote,

One might argue similarly that a cyber insurance backstop would subsidize those companies whose security posture creates the potential for cyber catastrophe, such as the NotPetya attack that caused $10 billion in damage. Infection in this instance could have been prevented by basic cyber hygiene. Why should companies that do not employ basic cyber hygiene be subsidized by industry peers? The argument is even less clear for a taxpayer-funded subsidy.

The answer is to ensure that a backstop applies only to companies that follow basic cyber hygiene guidelines, or to insurers who require those hygiene measures of their policyholders. These are the types of controls many are familiar with: complicated passwords, app-based two-factor authentication, antivirus programs, and warning labels on emails. But this is easier said than done. To a surprising extent, it is difficult to know which security controls really work to improve companies’ cybersecurity. Scholars know what they think works: strong encryption, multifactor authentication, regular software updates, and automated backups. But there is not anywhere near as much empirical evidence as there ought to be about how effective these measures are in different implementations, or how much they reduce a company’s exposure to cyber risk.

This is largely due to companies’ reluctance to share detailed, quantitative information about cybersecurity incidents because any such information may be used to criticize their security posture or, even worse, as evidence for a government investigation or class-action lawsuit. And when insurers and regulators alike try to gather that data, they often run into legal roadblocks because these investigations are often run by lawyers who claim that the results are shielded by attorney-client privilege or work product doctrine. In some cases, companies don’t write down their findings at all to avoid the possibility of its being used against them in court. Without this data, it’s difficult for insurers to be confident that what they’re requiring of their policyholders will really work to improve those policyholders’ security and decrease their claims for cybersecurity-related incidents under their policies. Similarly, it’s hard for the federal government to be confident that they can impose requirements for a backstop that will actually raise the level of cybersecurity hygiene nationwide.

The key to managing cyber risks – both large and small – and designing a cyber backstop is determining what security practices can effectively mitigate the impact of these attacks. If there were data showing which controls work, insurers could then require that their policyholders use them, in the same way they require policyholders to install smoke detectors or burglar alarms. Similarly, if the government had better data about which security tools actually work, it could establish a backstop that applied only to victims who have used those tools as safeguards. The goal of this effort, of course, is to improve organizations’ overall cybersecurity in addition to providing financial assistance.

There are a number of ways this data could be collected. Insurers could do it through their claims databases and then aggregate that data across carriers to policymakers. They did this for car safety measures starting in the 1950s, when a group of insurance associations founded the Insurance Institute for Highway Safety. The government could use its increasing reporting authorities, for instance under the Cyber Incident Reporting for Critical Infrastructure Act of 2022, to require that companies report data about cybersecurity incidents, including which countermeasures were in place and the root causes of the incidents. Or the government could establish an entirely new entity in the form of a Bureau for Cyber Statistics that would be devoted to collecting and analyzing this type of data.

Scholars and policymakers can’t design a cyber backstop until this data is collected and studied to determine what works best for cybersecurity. More broadly, organizations’ cybersecurity cannot improve until more is known about the threat landscape and the most effective tools for managing cyber risk.

If the cybersecurity community doesn’t pause to gather that data first, then it will never be able to meaningfully strengthen companies’ security postures against large-scale cyberattacks, and insurers and government officials will just keep passing the buck back and forth, while the victims are left to pay for those attacks themselves.

This essay was written with Josephine Wolff, and was originally published in Lawfare.

3 Likes

lol, after reading this, I might bump up by 1% my cyber security companies in the port.

1 Like

This concept is very on-topic for this problem. I honestly don’t know or even think I know what the right answer is to the wisdom / value of establishing this type of cyber insurance backstop mechanism. It would be equivalent to the Pension Benefit Guarantee Corporation, which provides a final pool of money to tap to pay out pensions to employees of firms which go bankrupt and cannot meet their pension obligations to retirees.

On one hand, the goal is to protect the larger public by providing a means to compensate them for harm stemming from a force majeure level cybersecurity hack that a court could ajudicate was avoidable had an organization taken appropriate avoidance measures. On the other hand, the complexity of modern systems and their degree of integration produces a “butterfly in China” effect where it becomes virtually impossible to predict how a buffer overflow flaw in a wifi wireless keyboard driver could lead to the infection of 100,000 PCs at a Fortune 500 operating chemical plants that leads to an intentional sabatage of control systems at an ammonia plant, etc., etc.

Yet, those are exactly the stakes.

Perhaps such a cybersecurity backstop insurance plan is required in conjunction with something else I didn’t think to address… The creation of formal engineering standards for software systems and establishment of a true Software Engineering credential that is the equivalent of a Professional Engineer designation. It is illegal to erect a public building or instance of public infrastructure (bridge, tunnel, dam) without a certified Professional Engineer signing off on building specifications to ensure known standards reflecting decades / centuries of learning about beams, spans, compression strength, tensile strength, etc. are incorporated into a design. Those standards don’t merely require sign-off, they require the PE to “show their work” and show how specifications of the approved design were subjected to calculations to confirm every beam could handle every load with the required minimum margin of safety.

The software industry certainly creates reference materials on “design patterns” with descriptions of the problems they solve and how they function but there are no regulations requiring organizations to follow such patterns nor are there regulations requiring an organization to employ one or more qualified engineers with certification in such patterns to control application development and operations. In general, software is written by five or ten core developers pounding it out at the user interface, web service and core DB layers until it approximately meets functional requirements and can stay running in a test environment for a couple of days under a load test.

I’m a BSEE and do not have a Professional Engineer certification. If you’ve ever spoken with a PE, it is usually obvious within seconds you are dealing with a mind operating on a higher plane of problem solving. In contrast, if you have ever spoken to someone who designs, codes or tests software, you would find their expertise to be WIDELY variable and with only a passing correlation to their pay level. The software and networking industries are awash with various “certifications” pushed by vendors as a means of promoting their products and frameworks. Networkers can pursue certifications like CCIE and CCNA from Cisco, Network+ or Security+ from CompTIA, etc. Developers can get certifications from AWS or Google in their cloud products. Even project managers have certifications in requirements gathering, Agile development (which I personally think is a cult…), etc. NONE of these certifications require or provide the level of holistic understanding of system design mentioned earlier. They are too siloed.

I’m not sure what a viable solution is. It seems apparent whatever organizations are doing now is NOT sufficient.

WTH

4 Likes

I do not know your age. If you have not heard it this may be a shock. You are an engineer. There is no “solution”. There is always another problem. LOL

If you find no solution to a problem, it is time not to find that a problem. (this might be dating advice)

I am a part engineer with some training, a larger part artist, and a larger part economist. When I see some discussions here go into endless stuff that we are wasting time with I am my father’s son and want to kick ahats.

My best friend is a retired engineer. He has your mindset. I do get it but the artist in me works around that.

That said the “evolution” as opposed to the “solution” is the government insurance. Because then the government will be able to regulate the premiums. Between the government and the corporate world, it will be less expensive to create regulations and have the systems you suggested built by tech companies and maintained by the private and public sectors.

The process will also get simplified because things are in the “cloud”. A major corporation might spring up working with AWS, Azure, and Google Cloud Services. This would bring down costs and bring smaller companies along, bringing down costs further.

We have left the supply-side period and honesty is seeping back into American life.

1 Like

Seems to me that using tax dollars to reduce the risk to private business will also reduce the incentive for private business to solve the problem.

Just my nonexpert opinion but I don’t think this problem is resolvable in the near future. I think institutions are going to have to decentralize, simplify, and compartmentalize their databases to create less attractive targets that are easier to secure. And yes, I do believe there will be increasing use of paper files of some sort, the ultimate air-gapped offline storage.

3 Likes

In the past (25-30 years ago), companies had private networks within the company that were not Internet connected… Specifically so the company could directly control access.

Yes, this is sort of “air gap” approach?

Surely something like this could be imagined and realized?

Especially for something like healthcare?
For a business with constant new orders, it might be more complex.

I’m imagining something that is another barrier to infection. I don’t see it being perfect… Just better than the current status quo.

I read this thread, and saw the “an attack is implemented over months” info.
I assume this would have to be part of the design.

:face_with_monocle:
ralph thanks you for your time, energy, effort in doing these deep dives!

1 Like

I’m not sure there is an absolute solution. This is evolutionary; offense and defense each making advances in an endless cycle.

The best action may involve the “Sneaker Solution.”

Two hikers were in the wild and noticed a large bear running after them. Hiker 1’s reaction was to start running. He turned his head to see the other hiker taking off his boots and putting sneakers on. Hiker 1 yelled “Don’t be a fool, you can’t outrun a bear!” Hiker 2 replied “ I don’t have to outrun the bear, I just have to outrun you.”

The low hanging fruit gets picked first.

1 Like

Seems to me that using tax dollars to reduce the risk to private business will also reduce the incentive for private business to solve the problem.

Yup. The classic economics definition of a moral hazard. The process by which providing protection from the cost of undesired behavior results in encouraging MORE of that very behavior by reducing its pain.

In the past (25-30 years ago), companies had private networks within the company that were not Internet connected… Specifically so the company could directly control access.

This practice is something called network segmentation along with an administrative practice called “least privilege.” A typical commerce web site doesn’t run all of the layers (user interface, web services, partner integrations, database) in a single LAN segment. If the user interface exposed to the web has a security flaw, an external party can hack the UI layer, escalate to root permissions, then they would have root access to the other functions closer to the core database and external partners to exploit.

Instead, each layer is put in its own LAN segment with firewalls BETWEEN each LAN, somewhat like having a vault inside a vault inside a vault. Servers inside the outermost tier are only allowed to accept connections from the outside and are only allowed to make connections to web services inside the next layer in and to a minimum set of management functions (DNS, NTP, network monitoring tools).

This is useful in minimizing risk from EXTERNAL actors who are literally originating their intrusion from the outside However, it doesn’t protect the interior segments from INSIDE threats (disgruntled employees wanting to steal data before quitting, etc.) nor does it protect the system against an internal threat stemming from trusting code provided from external sources. As mentioned in the thread, software is so complex that many systems pull in updates from their maker nearly continuously, much like your AV software on your home computer pulls updates nearly every HOUR because threats are evolving that rapidly. But when you install that software, you are TRUSTING your AV provider to not be the threat (intentionally or unintentionally…).

If an interior LAN is compromised via this “supply chain” approach, then those other LAN segments are all vulnerable since they are all administered via privileges which typically trust LAN segments “further inside.” That doesn’t mean a company with 50,000 employees allows the desktop LAN to reach inside the DB or WEB service tiers of its data centers but any compromise of an administrative LAN in a data center would essentially expose every OTHER LAN segment in the data center with potentially valuable data.

This problem is arguably worse when nearly every DB segment in a data center has common connectivity either to a “backup LAN” housing all of the hardware managing backups or to the organization’s internal “data warehouse” where data from all kinds of systems is brought together for bulk analytics and mining.

In some sense, companies need to make a decision what is more important…

a) Retaining terabytes of data about user selections in a portal application from three years ago on the CHANCE someone might mine some interesting insight about that customer’s willingness to buy a product two years from now, or

b) Minimizing retained data to minimize the incentive for someone to attempt an external data heist and to minimize the volume of data needing to be restored in a physical or security disaster scenario?

"I don’t have to outrun the bear, I just have to outrun you.”

Much of the spending on security software and firewalls is based on this very concept. Every building needing protection has SOME flaws. However, thieves are often lazy and prefer the easy mark over the better defended mark. For these thieves, you only need to be better defended than someone else to encourage the thief to try somewhere else. For large companies, their firewall isn’t immune to DDOS attacks, but $2 million dollars in A/B redundant firewall gear can supress more of a DDOS attack than $750k in firewall gear so the firm with $2 million in firewall protection stays online and the company with only $750k in firewall gear gets taken down for a few hours.

WTH

4 Likes

Except it would not be tax dollars. There would be premiums to pay.

1 Like

I wonder if one possible “solution” (nothing’s ever a complete solution) is not to focus on the attacks, but to focus on the payments. The handoff. For ransomware (as opposed to malware generally), the black hats want to get paid. A big part of cutting down on organized crime and drug cartels was always to make it as difficult as possible to launder the ill-gotten gains.

The nuclear tactic there would be to ban cryptocurrencies. Ownership and use. If the hackers can’t denominate their ransom in crypto, it gets real hard for them to arrange a payment. Not impossible - ransomware pre-dates crypto. But crypto makes ransomware a whole lot easier. If U.S. companies can’t pay ransoms in crypto, it makes U.S. companies a lot less profitable of targets for ransomware.

The other possibility is to focus a lot more resources - a lot - on tracking the black hats down through the payment side and making it as unprofitable as possible to target U.S. companies. Again, almost all ransomware attacks these days provide for payment in crypto - and the unique characteristics of crypto make enforcement on the payments side possible, so long as sufficient resources are put into trying to go after them.

3 Likes

Yup. I think I read a statistic (maybe it was an “Irish fact” - a fact that sounds so good and helps the story so much it HAS to be true) that said virtually ALL ransom payoffs are via cryptocurrency since attackers feel they have more anonymity with cryptocurrency than traditional banking accounts.

This is kind of amusing because cryptocurrency is just DATA and while the initial reference to a cryptocoin may appear anonymous cuz it doesn’t come with a tag that says “Owner = Vladimir Putin”, the actual DATA of that “coin” in the blockchain IS unique and immutable. Anywhere that data goes provides a tie to some actor tied to that coin. NOTHING is invisible or traceable.

Law enforcement agencies across the globe need to become more literate with cybercurrency technologies to chase down these gangs and halt the payoffs and cut off the incentive for these attacks. Of course, that still leaves terrorists and state actors who aren’t acting on profit motives.

WTH

2 Likes

Hence the development of crypto tumblers. You’re absolutely right that the benefit of financial transfers without intermediaries (like banks or other financial institutions) is offset a little by the data trail. But if you use the crypto version of money laundering, you can cover your tracks a fair amount.

3 Likes

One simple way to substantially reduce the probability of getting hit with ransomware, at least in the near future, is to use Macs. With only a fraction of the enterprise market as windows, Macs have the same advantage as the small skinny turkey over the big fat one in the days approaching Thanksgiving.

1 Like

Re the problem of backups across multiple systems which are not in sync, one option is the use of after-imaging in which the record of a change in the database is captured as a part of the transaction. If the after-image files are moved regularly to the system with the recovery database(s) and applied to those databases, the recovery can be up to date within minutes of the active database. There is a risk of the attack preventing the safe transfer of the latest after image files, but, this is small compared to the issues with full and incremental backups and the recovery system is ready to be switched into active use with no further processing as soon as it is safe.

3 Likes

The air gap approach is worthwhile, yet overrated.

The diffusion centrifuges in Iran were air-gapped, yet bad actors (us & Israel) found a way in. Admittedly this was an unusual situation, but could be replicated pretty easily for a health care system.

At Westinghouse (in the 80’s) we ran a proprietary financial system, all hardwired back to a mother-ship in Connecticut, there was no way to “break in”. But a health care system requires terminals in every doctor’s office, every hospital corridor, every ER, urgent care, X-ray room, and well, everywhere. Even if it never connected to the (so-called) internet, it would be trivial to jump an air gap by simple distribution of flash drives or other means. And it’s just not practical to go back to proprietary, hard wired systems for enterprises this large. (How would Schwab conduct business with customers, for example?)

So it’s a dilemma, and I don’t see “insurance” as a viable option, but perhaps I’m not imagining it correctly. I see a lot of moral hazard in giving sysops an easy out, and I don’t think a ban on crypto is practical either, simply because it’s so easy to create (there are hundreds of them now).

No answers here, only hand wringing. Sorry.

2 Likes

The purpose of the ban isn’t to prevent the creation of crypto (though it would certainly stop a lot of that if a lot of countries did it). It’s to prevent the ransom targets from being able to pay in crypto. If U.S. companies cannot legally acquire crypto (of any kind), then they won’t be able to pay ransoms in crypto without breaking the law. That puts a barrier in the business model of the hackers - they gain nothing from hitting a company if they can’t pay them.

Sure, some few small privately held companies might be willing to break the law - but most executives of most companies won’t do it. They can’t use company funds to buy crypto without there being a huge paper trail - and if that paper trail puts them in jail personally, they’re just going to tell the hackers they can’t pay the ransom that way. Even if lots of cryptocurrencies exist, few executives that have to answer to an audit are realistically going to use them to pay a ransom.

1 Like

The world operated before the internet. Things were slower, and more expensive, but the world did operate. Remember the engineer’s dilemma “fast, cheap, good, pick two”? The “JCs” have chosen fast and cheap, gambling that the shortcuts, like MCAS in a 73 Max, will not bite them.

Steve

1 Like

They will contract with a security agency (overseas company where crypto is legal) for a $10,100,000 fee. The security company will use $100,000 as their fee for their services, and will use the $10,000,000 to buy crypto to transfer to the hackers. The contract will not mention HOW the security agency will fix the problem, just that they WILL fix it.

1 Like