Murphy's Law and the lessons of the CrowdStrike outage

Acronis
Acronis Cyber Protect
formerly Acronis Cyber Backup

"Anything that can go wrong will go wrong, and at the worst possible time." — Edward A. Murphy Jr., American aerospace engineer 

The CrowdStrike outage of July 19, 2024 has already been covered in a thousand news stories and a hundred memes. The short version: Cybersecurity vendor CrowdStrike issued a minor but flawed configuration update to its EDR product that caused 8.5 million Microsoft Windows systems to crash. Rebooting from the “blue screen of death” caused by the bug merely resulted in another blue screen. The outage crippled airlines, hospitals, banks, TV broadcasters and other businesses around the world, affecting millions of travelers, patients and consumers, not to mention the tens of thousands of IT pros who had to spend their weekend manually applying a fix to each affected computer to revive it.

We come here not to criticize CrowdStrike nor conduct a detailed forensic analysis of the mistakes that led to the fiasco. The incident is the consequence of the kind of human errors and technology failures that could befall any tech vendor, and will afflict other tech vendors and businesses regularly in the coming months and years. Rather, we want to identify some lessons that can be learned from the outage to help your business better defend itself against such incidents, and be better prepared to recover when defensive measures fail, as they eventually and inevitably will do.

CrowdStrike-like incidents WILL happen again

That word “inevitably” is the crux of a key lesson here: recognizing that such incidents are going to occur despite the best efforts of our colleagues, partners, suppliers, governments, regulators and crime-fighting agencies. To paraphrase Murphy, “Stuff happens.” Cybercriminals launch over a quarter of a million new malware instances every single day. Well-meaning employees make mistakes. Software bugs slip out into the world undetected. Hardware components wear out and fail. Mother Nature throws hurricanes, wildfires, blizzards and floods at us. Suffering an outage sooner or later is a certainty as reliable as the tides.

The risk management professionals in our ranks intuitively understand this reality. They are a key driver in some parallel developments that have recently emerged from three distinct directions: regulatory authorities, cybersecurity standards developers, and the insurance industry. Consider:

  • Brand new compliance standards like the EU’s Digital Operational Resilience Act (DORA) and revisions to existing compliance standards like the EU’s Network and Information Systems Directive 2022/0383 (NIS 2).
  • New versions of existing cybersecurity standards like the National Institute of Standards and Technology (NIST) Cybersecurity Framework (CSF) Version 2.0, a/k/a/ NIST CSF 2.0.
  • Evolving insurability standards for businesses to qualify for cyber insurance policies.

Each of these have historically placed a strong emphasis on cybersecurity defenses like endpoint protection, strong authentication, and security awareness training. But in the last year or two, they have added much stronger emphasis on recovery that is based on pillars like backupdisaster recovery and incident response planning. This reflects a broader recognition in the world that true cyber resilience requires both.

This is not news to Acronis; we came to this concept from the other direction, starting as a backup vendor 20 years ago, and introducing cybersecurity natively integrated with backup eight years ago. We have long believed that the combination of defense and recovery now being promoted by regulators, standards bodies and insurers is essential to preserving a business’s uptime and data integrity.

How businesses should respond to the CrowdStrike incident

So, if you are a cybersecurity, IT operations or risk management leader building a business case for a recovery infrastructure upgrade — say, investing in cloud disaster recovery services for the first time, or formalizing an incident response plan — try this line: “Building better cyber resilience will not only boost our chances of avoiding a painful incident like the CrowdStrike outage, but will improve our compliance posture, align us better with industry best practices outlined by cybersecurity frameworks, and improve our ability to qualify for competitively priced cyber insurance.”

That’s the big-picture takeaway from this particular example of Murphy’s truism that, despite our best efforts, things occasionally go utterly off the rails. We’d be remiss as a tech vendor if we didn’t also offer some tactical advice on how to respond to the CrowdStrike outage, specifically:

  • Establish a process for testing updates in a protected “sandbox” environment before rolling them out to all systems. This may require disabling automatic updates and rolling out the updates only after testing. Where such controls are not available (as with CrowdStrike prior to this incident), focus on rollback capabilities instead.
  • Maintain up-to-date backups and implement rollback and recovery procedures to mitigate potential system or application upgrade issues.
  • Empower system administrators with the ability to initiate recovery remotely for all affected systems that are still functionally operational.
  • Provide remote users with clear guidance for manual recovery of systems that are so impaired they cannot be remotely remediated.
  • Consider deploying features like Acronis One-Click Recovery, which enables employees with no IT skills to restore their own systems simply and quickly.

If you’d like to discuss your cyber resilience challenges with an Acronis solutions engineer, book a call with us here.

For more information, consider these resources:

About Acronis

A Swiss company founded in Singapore in 2003, Acronis has 15 offices worldwide and employees in 50+ countries. Acronis Cyber Protect Cloud is available in 26 languages in 150 countries and is used by over 20,000 service providers to protect over 750,000 businesses.