Cyber Threat Hunting

Last Updated on July 29, 2024 by Arnav Sharma

On July 19, 2024, a seemingly routine update from CrowdStrike, a leading cybersecurity firm, led to a global incident that disrupted millions of Windows systems. This blog provides a comprehensive overview of the event, detailing what went wrong, how it happened, the immediate fallout, and the measures CrowdStrike is implementing to prevent such issues in the future.

The Incident

What Happened?

On the early morning of July 19, 2024, CrowdStrike deployed a Rapid Response Content configuration update designed to enhance telemetry and detect new threat techniques. This update inadvertently contained problematic content data due to a bug in the Content Validator system. The update led to out-of-bounds memory reads, triggering the infamous blue screen of death (BSOD) on millions of Windows devices.

Scope and Impact

The faulty update affected approximately 8.5 million Windows machines running Falcon sensor version 7.11 and above. The issue caused significant disruptions across various sectors, including banking, airlines, and healthcare services, highlighting the dependency of critical infrastructure on reliable cybersecurity solutions.

Technical Details

The Root Cause

CrowdStrike’s Falcon platform uses two types of content updates: Sensor Content and Rapid Response Content. The incident was triggered by the latter. Specifically, a new IPC (Inter Process Communication) Template Type, introduced to detect malicious use of Named Pipes, was the source of the problem. On July 19, two new IPC Template Instances were deployed. Due to a bug in the Content Validator, one of these instances, containing faulty data, passed through the validation process and was deployed to the production environment.

Timeline of Events

  1. February 28, 2024: Falcon sensor version 7.11 was released, introducing the new IPC Template Type.
  2. March 5, 2024: The IPC Template Type passed stress testing and was validated for use.
  3. April 2024: Several IPC Template Instances were deployed successfully.
  4. July 19, 2024 (04:09 UTC): Two additional IPC Template Instances were deployed.
  5. July 19, 2024 (05:27 UTC): The problematic update was identified and reverted.

The Technical Flaw The error was in the Rapid Response Content update, specifically Channel File 291. This file, responsible for evaluating named pipe execution, contained problematic content that led to out-of-bounds memory reads, resulting in BSOD.

The Immediate Fallout

Global Disruption

The faulty update caused widespread outages, affecting millions of devices and leading to significant disruptions. Major organizations, including banks, supermarkets, airlines, and healthcare providers, faced operational challenges as their systems crashed.

Company Response

CrowdStrike quickly identified the issue and reverted the problematic update within 1.3 hours. However, the damage had already been done, with many systems around the world experiencing significant downtime.

Customer Support and Communication

CrowdStrike CEO George Kurtz issued a public apology, emphasizing the seriousness of the incident and reassuring customers that it was not the result of a cyberattack but an internal software error. The company mobilized its resources to help affected customers restore their systems.

Preventive Measures and Future Steps

Enhanced Testing and Validation

CrowdStrike is introducing several new measures to prevent such incidents in the future:

  • Improved Testing: The company is enhancing its testing protocols, including local developer testing, stress testing, stability testing, fuzzing, and fault injection.
  • Additional Validation Checks: New validation checks are being added to the Content Validator for Rapid Response Content to catch problematic content before deployment.
  • Enhanced Error Handling: The Content Interpreter’s error handling capabilities are being improved to better manage unexpected exceptions, an issue highlighted by the recent windows operating system crash. CrowdStrike reveals enhanced error management.

Deployment Strategies

  • Staggered Rollouts: Future updates will be deployed in a staggered manner, starting with a canary deployment to a small subset of systems before a full rollout. CrowdStrike said this method aims to minimize risk.
  • Improved Monitoring: Enhanced monitoring of sensor and system performance during updates will guide a phased deployment approach.
  • Customer Control: Customers will have greater control over the delivery of Rapid Response Content updates, with options to choose when and where updates are deployed. CrowdStrike reveals new scheduling features.

Transparency and Accountability

CrowdStrike has committed to full transparency regarding the incident, as CrowdStrike also reviews its protocols. The company has published a Preliminary Post Incident Review and is working on a detailed Root Cause Analysis, which will be released publicly once the investigation is complete.

Third-Party Validation

To ensure the robustness of their processes, CrowdStrike is conducting multiple independent third-party security code reviews and end-to-end quality process reviews.

As the investigation continues and more details emerge, the industry will undoubtedly gain valuable insights into improving update deployment and validation processes. For now, CrowdStrike’s efforts to address the root causes and implement preventive measures provide a framework for resilience and continuous improvement in the face of global tech challenges.


FAQ: 

Q: What company experienced an outage that affected airlines and other businesses?

CrowdStrike, a well-known cybersecurity firm, experienced an outage that impacted various businesses, including airlines. The incident drew significant attention due to the critical nature of the services affected and the larger context of the global tech outage.

Q: What went wrong during the CrowdStrike outage?

CrowdStrike identified a bug in one of their updates. This bug caused their cybersecurity systems to distribute incorrect data, or “bad data,” to millions of their customers’ computers, leading to widespread issues.

Q: What did CrowdStrike reveal in their preliminary post-incident review?

In their initial review after the incident, CrowdStrike disclosed that they are still gathering and analyzing information. They promised to provide more details as they complete their investigation.

Q: What caused the CrowdStrike systems to push bad data?

The root cause of the problem was a bug that the system could not manage properly. This bug disrupted normal operations and caused the erroneous data distribution.

Q: What impact did the outage have on businesses?

Many businesses that rely on CrowdStrike’s cybersecurity services experienced significant disruptions. The outage created a chaotic situation as these businesses struggled to manage without their usual cybersecurity protections, a crisis worsened by the global tech outage.

Q: What is CrowdStrike doing to prevent this issue from happening again?

To prevent a similar incident in the future, CrowdStrike is implementing a new verification process (referred to as “a new check”) to avoid another windows operating system crash. They are also enhancing their internal testing procedures and introducing additional preventive measures to strengthen their systems.

Q: What is CrowdStrike’s plan for remediation?

CrowdStrike’s remediation plan involves launching new preventive strategies to avoid future issues and providing their customers with more control over their cybersecurity settings and updates.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.