It's unfortunate and ironic that using CrowdStrike is intended to protect businesses from outages that can take down their IT systems in part or in whole, but in this particular case it was a mistake made by CrowdStrike itself that did the taking down. It was not malicious nor was it intentional. It may well have been caused simply because we are all human and humans miss things and make mistakes or aren't able to catch every possible outcome of editing a file.
Because of the resulting blue screen of death, it was not easy for CrowdStrike, once they had recognized and put in a fix for the file they found was at fault, "Channel File 291", to get the fix to all of the Windows servers affected.
Often in the rare instance that an upgrade could cause issues, CrowdStrike would have been able to issue a new update very quickly to overwrite the update that was causing issues. However, because the Windows servers affected crashed, CrowdStrike was not able to reach those servers to update them with the fixed "Channel File 291". Some kind of intervention was required by the IT administrator of each Windows server in order to get the servers into a state where the issue could be addressed.
The affected Windows servers needed to be rebooted into Safe Mode, which is not a straightforward practice to do remotely. It's very easy to boot a Windows machine into Safe Mode when you are physically sitting at the machine as you can simply press F8 or SHIFT-F8 to get it into Safe Mode. However, those who could quickly get to their machines were still out of service for the time it took for them to discover the issue, get the recommended fix from CrowdStrike and perform the fix over 1 or many Windows servers they had that were affected.
Many administrators are not physically next to the servers they look after, and some may be within driving distance, but some may need to get on a plane to get to the servers.
There are options that can be used to remotely get a Windows server into Safe Mode, but they are complex and require skills and familiarity of tools used to do this by the IT Administrator. This includes Remote Management Tools, WinRE (Safe Mode or Windows Recovery Environment), RDP (Remote Desktop Protocol), PowerShell Remoting, Cloud Management Platforms, or use of Virtual Machines.
These tools are ideal if they work, but in many cases, fixing a blue screen of death can still require physical access to the machine especially in cases where the machine is rendered incapable of responding to network connections or those remote management tools.
Once in Safe Mode, the CrowdStrike issue required deleting the problematic file "Channel File 291".
What does"Channel File 291" do?
"Channel File 291" is a file located generally in c:\Windows\System32\drivers\CrowdStrike. It is a file that is used by the CrowdStrike Falcon Sensor that detects and targets newly observed malicious named pipes. When working correctly, detecting malicious activity of this nature is critical as named pipes are used for inter-process communications. Inter-process communications are in turn the mechanisms used by the Windows operating system for communicating between and sharing data between applications. Getting something wrong in this area can cause catastrophic failure, as we saw in July of 2024.
What trouble did the CrowdStrike IT incident cause?
The incident affected Windows servers that were using CrowdStrike for cybersecurity protection. Windows servers are widely used within IT for many of the world's largest businesses. Mac and Linux operating systems are also used by many businesses, but neither Mac or Linux had issues and businesses not running Windows were completely unaffected.
Microsoft issued a Blog on July 20, 2024 which stated that whilst the issue was not caused by Microsoft, it affected 8.5 million windows machines. Interestingly they said that this was < 1% of all Windows machines. Not all Windows servers use CrowdStrike. CrowdStrike is a cybersecurity company that provides endpoint protection, threat intelligence, and incident response services. CrowdStrike is used by many organizations to protect their servers and endpoints, however there are many other choices of cybersecurity solutions available from different vendors including Symantec, McAfee, Microsoft Defender.
There are plenty of opinions with respect to CrowdStrike and Microsoft regarding this incident. Microsoft were extremely pro-active at jumping in to help businesses like major airline companies get back online. It's been speculated that perhaps the Windows OS should not allow something running on it to get the Windows server into such a state. However, it's by partnering with other companies that do all sorts of very valuable things on the Windows platform, that many of us enjoy the benefits that we do today. This type of incident is very rare, and this particular one was the first of its kind to have the impact that it did.
Many airlines were affected including Delta Airlines, United Airlines, Alaska Airlines, Frontier Airlines and Southwest Airlines in the US. Other airlines affected were Allegiant Air, SunCountry, an ultra low cost US airline, Spirit Airlines, Canadian-based Porter Airlines, AirAsia, Australia's Quantas Airlines, and Singapore Airlines.
In addition to airlines, which were most prominent in the news due to the number of people stranded in airports, hospitals, banks, government entities, manufacturing, retail and more types of businesses were affected. Windows servers are very common in IT and any business running Windows servers using CrowdStrike who received the update on July 19, 2024 were affected.
What can we learn from the CrowdStrike July 2024 IT outage?
We have learned through cybersecurity attacks that the impact of an incident can have far reaching effects. In this case, there was an incident that had far reaching effects even though it was not a cyberattack.
An example of this is the SolarWinds incident of 2019 and 2021 which was a supply chain issue where a malicious attack caused outages in 30,000 organization. As a supply chain attack, it was at the time unexpected that hacking of SolarWind could have such a large reach, but it ended up as a major event.
The Colonial Pipeline incident in May of 2021 was a ransomware attack that caused traffic jams, and ground traffic to a halt in the east of the US and caused panic of Americans trying to store fuel.
Ultimately an impact that has the reach of the CrowdStrike IT outage, the Colonial Pipeline incident and the Kaseya VSA incident can be avoided with a lot of careful planning. We never expect the unthinkable to happen, and perhaps if you protect against it happening, you'll never know for sure if you did really stop something.
Here are a few policies can help either stop, or isolate these types of incident if they do occur:
- Backup & Restore Services: Most businesses backup their critical systems so that if an outage occurs, a system restore can be made as soon as possible with as little missing with regards to data as possible.
As an example, if a system was restored after the CrowdStrike IT incident, any data such as flight bookings that were made between the last backup and the time of the incident
would be lost. Frequent backups ensure as little data loss as possible, and enable a business to restore their systems to exactly how it was at a specified time in the past.
- OS redundancy: Perhaps a new one now is that we have learned that Windows servers were affected for the CrowdStrike incident, but Mac OS and Linux systems
were not affected. Perhaps redundancy needs to be with High Availability systems running on different operating systems. Often High Availability systems have 2 systems that are both running the same underlying OS.
A High Availability system typically consists of 2
or more servers that stay in sync. If one server goes down, the other server will detect that the other is down and will take over with all the data absolutely up to date.
In the case of the CrowdStrike IT Outage, having a High Availability system where one server was Windows, and another Linux could have saved the impact of the outage.
Of course, many businesses have many servers, and setting up High Availability for every one of them is expensive and time consuming and a lot to manage. Having said that, the cost of doing this, is likely much less than the impact of getting an outage of the size of the CrowdStrike IT outage. - Be cautious of updates: A fine balance is needed with respect to IT updates. It's important to update systems because an update can close off
a potential way that a hacker could access the system with a cyberattack. However, trusting the vendor to ensure they have properly tested their update has just been
shown to be equally important. Updating servers over a period of time can help. Updating just one, then 5, then in batches can help show that it's ok to go global.
Many enterprises will update to a sandbox system first and test it out before allowing the update in production for critical systems.
- Isolate and Segregate: Separating as many IT operations and applications on a network as possible ensures that if an incident happens on one part of the network,
it will only affect applications running on that one. It cannot traverse into other parts of the network that are not connected to it. This would not have helped in the
CrowdStrike case, but can help in many other cases where something gets onto the network and moves around infecting other parts of a network.
- Use Zero Trust Principals: Zero Trust is the concept of only allowing access to systems and applications and even features within those systems and applications to specific people and systems that need it. It starts by allowing no one, and nothing any access, and then adds access one by one as it's needed. This is very much a best practice but would not have helped with the CrowdStrike incident. It could help with many other cases where a network could be compromised.
With respect to the above list, it's CrowdStrike themselves that can control how many servers an update is rolled out to once it is ready. To their credit, CrowdStrike do follow a phased rollout policy themselves. However in the case of the bad "Channel File 291", we have learned that while rollouts of CrowdStrike's software are subject to phased roll out policies, this does not include what they call "Rapid Response Content" updates. Part of ensuring that CrowdStrike can respond to Day 0 cyber attacks is that it is able to push out updates called "Rapid Response Content" very quickly. Rapid Response Content updates happen faster and more frequently than other types of updates to keep up with these threats and protect against them as soon as they are discovered.
Summary
Usually, we write about the impact of a cyberattack, what happened, why it happened and give some thoughts on how to protect against a similar incident that might happen in the future. This type of incident though, ultimately stated as caused by a malfunctioning file updated by the very provider that is built to protect systems from outages caused by hackers is a completely different concept.
A way to protect against this kind of incident in the future would be for all vendors to roll out upgrades very cautiously, and most do. Many vendors use their own products and roll out upgrades to demo systems, test systems, their own systems and then a very small select list. They then wait and are sure that there are no ill effects before doing a global roll out.
Even more detail on what happened
CrowdStrike's Falcon Security platform has sensors that are part of the security deployment that collect telemetry information and look for potential threats. These sensors use "Rapid Response Content" that do behavioral pattern-matching. The sensor can look for and detect specific behaviors and then observe, detect or prevent behavior. A Channel File, such as "Channel File 291" is read by the sensor and uses configuration that the business has configured to decide whether to observe, detect or prevent types of behavior. A Content Interpreter reads Channel files that are then used by the sensor to determine behavior. The Content Interpreter should handle exceptions from anything it reads that is problematic. In the case of this IT outage, content in the 'Channel File 291" resulted in an out-of-bounds memory read that caused an exception. Exceptions should be caught and handled, but the exception was not handled gracefully in this case and ended up causing a system crash and the blue screen.
CrowdStrike's Remediation and Guidance Hub Falcon Content Update for Windows Hosts
Helping our customers through the CrowdStrike Outage by Microsoft