The world witnessed a Black Swan in Cybersecurity recently: a global outage caused by a bug in cybersecurity software, led to the dreaded Blue Screen of Death (BSOD) on countless Microsoft Windows machines. This wasn’t your typical cyberattack, and understanding why it’s different is crucial for risk managers.
A Black Swan event, in Nassim Nicholas Taleb’s terms, is highly improbable, carries extreme impact, and is often explained in hindsight. The Microsoft-CrowdStrike event shares some Black Swan characteristics. It was unforeseen, with a widespread impact on businesses and critical infrastructure.
The Microsoft-CrowdStrike BSOD event has some characteristics of a Black Swan event, but it’s debatable whether it fully qualifies. Here’s why:
Rarity: Such a large-scale compatibility issue between two major, widely trusted vendors is exceptionally rare. Both Microsoft and CrowdStrike are industry leaders known for their rigorous testing and quality assurance processes.
Severe Impact: The BSOD issue led to widespread system crashes, paralyzing critical business operations across various sectors. The incident’s impact was immediate and severe, causing financial losses, reputational damage, and operational chaos.
Predictability in Hindsight: While it is easy to identify the root cause after the fact, few could have predicted this precise issue. The interaction between a Windows update and a cybersecurity tool, leading to such a catastrophic failure, was unforeseen.
Underlying Ingredients Present: The event involved common elements – software integration and bugs – unlike a truly unpredictable event.
Focus of Traditional Risk Models: While not the typical focus, these models often address software glitches, even if not from security tools.
While the Microsoft-CrowdStrike BSOD event may not fit the traditional definition of a black swan event as perfectly as other incidents, it possesses several key characteristics that qualify it as such. Its rarity, severe impact, and the challenge it presents to conventional risk modeling underscore the need for a more comprehensive approach to cybersecurity risk management.
By recognizing the potential for such events and adapting risk models accordingly, organizations can better prepare for and respond to unexpected and significant disruptions.
Unlike the aforementioned black swan events like Stuxnet or the SolarWinds attack, the Microsoft-CrowdStrike BSOD event was not a result of malicious activity or sophisticated cyber espionage. Instead, it stemmed from a compatibility flaw between widely used software, showcasing a different facet of cybersecurity vulnerabilities—one rooted in software interactions rather than external threats.
Assumed Trust in Major Vendors: Risk managers typically trust that major vendors like Microsoft and CrowdStrike will ensure compatibility between their products. The assumption is that any potential issues would be identified and resolved during the vendors’ extensive testing phases.
Focus on External Threats: Most risk models prioritize external threats such as hacking, malware, and phishing attacks. Internal software compatibility issues, while acknowledged, are often deemed less likely and less critical.
Complexity of Dependency Modeling: The interdependencies between various software products are complex and challenging to model accurately. The sheer number of possible interactions makes it difficult to predict specific failure points.
This event highlights the importance of a multilevel risk assessment approach.
So how can we prepare for the unpredictable?
Remember traditional “Risk Modelling” is always “Pre-ante” unlike any “Incident Response” which is “Post-ante”
Threat Label: Software Malfunction
Threat: Design error, installation error or operating error committed during modification causing incorrect execution.
Vulnerability: Possibility of incorrect configuration, installation or modification of the operating system
Impact: Complete / Major Outage to Business Operations
CIA Triad: Availability
Potential Controls: <Mentioned Later>
Here are some ways to adapt your risk models:
Scenario Planning: Include Risk scenarios where security tools (or software) malfunction or have unintended consequences.
Risk Register Updates: Catalogue the above Threat and vulnerability mapping provided if not already done and identify potential impact in your environment. Let us see a worked out example first.
Risk = Probability of Threat X Impact
Any risk manager who is modelling this would consider probability that a major vendor OS will be incompatible with another major security vendor’s security agent. Also, that they are not tested thoroughly before releasing is usually very very low. This is because
IT Teams update OS Vendor patches frequently and thinks do break regularly but these are custom business applications that break not the entire OS failing to boot. Security Software Upgrades are usually, as per policy, set to auto upgrade, to get the latest and greatest signatures
Worked out Example :
Probability of Threat X Impact = 0.3 X 100 = 30 (Risk Score).
Even if there is a 30% of chance of above specific threat occurring i.e. software compatibility between two major vendors causing outages (which IMHPO is too high probability for this particular threat) and especially the impact being 100/100. This risk will be lower quadrants in any risk register. Read it as hard to predict and prioritize proactively.
3. Vendor Risk Assessment Updates: If you have security questionnaires being provided to different vendors as a part Third Party Risk Management (TPRM) then ensure you have a question like below.
Regular Compatibility Testing: Implement a routine for testing critical updates in a controlled environment before deploying them across the organization. This helps identify potential conflicts before they cause widespread issues. Thorough testing of any patches, upgrades in UAT/ STG Environments before rolling out in production. Default Auto upgrade setting may be disabled in critical business impacting environments.
Automated Testing Tools: Invest in automated tools that can simulate and test the deployment of updates across a virtualized environment, identifying potential issues without impacting live systems.
Redundancy and Backup: Maintain redundant systems and regular backups to ensure business continuity in case of critical failures.
User Training: Train IT staff and end-users on recognizing early signs of compatibility issues and reporting them promptly.
Incident Response Planning: Develop and regularly update incident response plans to address potential software compatibility issues. Ensure that there are clear protocols for quickly rolling back updates and restoring systems to a stable state.
BCP/DR Testing: Regular drills for testing Business Continuity and Disaster Recovery. In critical sectors like aviation and healthcare one may consider having an alternative interim-physical (non digital) infrastructure for business continuity.
Virtual Desktops: Especially in this scenario, using dumb terminals and virtual desktops would allowed to rollback to previous versions quickly.
Expanded Vendor Risk Assessments: Evaluate the potential impact of software bugs during vendor assessments.
Software Diversity: Reduce reliance on a single security vendor by using a combination of tools from different providers. This might have other repercussions, like IT teams need to be trained on multiple tools now (remember there is already a tools fatigue !)
Vendor Collaboration: Foster closer collaboration with major vendors to stay informed about upcoming changes and potential compatibility concerns. This can include participating in beta testing programs and staying active in vendor user communities.
The Microsoft-CrowdStrike event serves as a wake-up call. While we can’t predict every Black Swan, we can adapt our risk models and security practices to be more resilient to the unexpected. By incorporating these suggestions, you can build a more robust security posture that can weather even the most surprising storms.
Recent Comments