CrowdStrike, the cybersecurity company behind one of, if not the biggest IT outage in history, has explained how a bug led to a massive failure that affected 8.5 million Windows machines. The incident occurred after the company released an update to its malware software, Falcon Sensor, on July 19. The update, which included a problematic code snippet, caused machines to execute code that triggered an ‘out-of-bound memory read,’ resulting in critical boot failures and 8.5 million blue screens of death.
CrowdStrike’s Falcon Sensor uses ‘Sensor Content’ to define its capabilities. This software is updated with ‘Rapid Response Content’ designed to detect and collect information on new threats. Sensor Content relies on ‘Template Types,’ which are lines of code with pre-defined fields for threat detection engineers to use with Rapid Response Content. The new detection information is delivered in ‘Template Instances,’ which can change the software’s behavior, granting it improved detection, identification, and prevention capabilities.
In February 2024, CrowdStrike announced the ‘InterProcessCommunication (IPC) Template Type.’ On March 5, after passing testing, it was released as a Template Instance. Three additional IPC Template Instances were released between April 8 and April 24, followed by two more on July 19. One of these instances included the code that caused the widespread outage.
CrowdStrike has stated that the IPC Template Instance containing the problematic content data wasn’t picked up internally due to ‘a bug in the Content Validator.’ Unfortunately, the company hasn’t elaborated on the nature of the bug or what the Content Validator does. However, as the name suggests, the software is presumably responsible for validating content, looking for faults and errors within code. In this case, the Content Validator failed to identify the faulty data, leading to the widespread outage.
The outage impacted a wide range of organizations, including airlines, supermarkets, telecommunications companies, and emergency services, highlighting the severity of the bug and the consequences of such a widespread system failure. This incident serves as a reminder of the critical role that software testing plays in ensuring system stability and preventing potentially catastrophic outages.