When computer screens around the world turned blue on Friday, flights were grounded, hotel check-ins became impossible, and deliveries were halted. Businesses resorted to pen and paper. Initial suspicions fell on some kind of cyberterrorist attack. The reality was simpler: a botched software update from the cybersecurity firm CrowdStrike.
“In this case, it was a matter of updating the content,” said Nick Hyatt, director of threat intelligence at security firm BlackPoint Cyber.
Since CrowdStrike has a broad customer base, the content update was felt around the world.
“One mistake could have had catastrophic consequences,” Hayat said. “This is a great example of how connected our modern society is to information technology – from coffee shops to hospitals to airports, a mistake like this has devastating consequences.”
In this case, the content update was tied to CrowdStrike Falcon’s surveillance software. Hyatt says Falcon has deep connections to monitor malware and other malicious behavior on endpoints — in this case, laptops, desktops, and servers. Falcon automatically updates itself to address new threats.
“The faulty code was introduced through the auto-update feature, and here we are,” Hyatt said. Auto-update capability is standard in many software applications, and not unique to CrowdStrike. “Because of what CrowdStrike is doing, the consequences here are catastrophic,” Hyatt added.
Blue Screen of Death errors appear on computer screens due to a global communications outage caused by CrowdStrike, a cybersecurity services provider for US technology company Microsoft, on July 19, 2024 in Ankara, Turkey.
Harun Ozalp | Anadolu Agency | Getty Images
Although CrowdStrike was able to quickly identify the issue, and many systems were back online within hours, the global chain of damage is not easily reversible for organizations with complex systems.
“We think for three to five days before we resolve things,” said Eric O’Neill, a former FBI counterterrorism and counterintelligence agent and cybersecurity expert. “That means a long pause for organizations.”
It didn't help that the outage happened on a summer Friday when many offices were empty, and there was a lack of IT to help fix the problem, O'Neill said.
Software updates should be rolled out gradually.
One lesson learned from the global IT outage, O'Neill said, is that the CrowdStrike update should have been rolled out gradually.
“What Crowdstrike was doing was pushing out their updates to everyone at once. That’s not the best idea. Sending it to one group and testing it. There are levels of quality control that you have to go through,” O’Neill said.
“The software should have been tested in protected environments, and in multiple environments, before it was released,” said Peter Avery, vice president of security and compliance at Visual Edge IT.
It is expected that further safeguards will be needed to prevent future incidents in which this type of failure is repeated.
“You need the right checks and balances in companies. Maybe one person decided to push this update, or maybe someone picked the wrong file to implement it,” Avery said.
The IT industry calls this a single-point failure—a fault in one part of a system that leads to a technological disaster across industries, jobs, and interconnected communications networks; it's known as a massive domino effect.
Call for Building Redundancy into IT Systems
Friday's event could prompt companies and individuals to raise their cyber preparedness.
“The bigger picture is how fragile the world is; it’s not just an electronics or technology problem. There are a number of different phenomena that can cause power outages, like solar flares that can destroy our communications and electronics,” Avery said.
Ultimately, Friday’s collapse was not an indictment of CrowdStrike or Microsoft, but of how businesses view cybersecurity, said Jawad Abed, an associate professor of information systems at Johns Hopkins University’s Carey School of Business. “Business owners need to stop looking at cybersecurity services as just a cost and instead look at them as a fundamental investment in the future of their company,” Abed said.
Companies should do this by building redundancy into their systems.
“A single point of failure should not stop a business, and that’s what happened,” Abed said. “You can’t rely on one tool for cybersecurity, and that’s the basics of cybersecurity.”
While building redundancy into enterprise systems is expensive, what happened on Friday is even more expensive.
“I hope this serves as a wake-up call, and I hope it leads to some changes in the mindset of business owners and organizations to review their cybersecurity strategies,” Abed said.
What to do about “kernel-level” code
At the macro level, it’s fair to place some blame on a system in the enterprise IT world that often views cybersecurity, data security, and technology supply chain as “nice-to-haves” rather than essentials, and a general lack of cybersecurity leadership within organizations, said Nicholas Rees, a former Department of Homeland Security official and a professor at New York University’s SPS Center for Global Affairs.
At a micro level, Rees said, the code that caused the disruption was kernel-level code, which affected every aspect of communication between hardware and software in a computer. “Kernel-level code should be subject to the highest level of scrutiny,” Rees said, with approval and execution being entirely separate processes with accountability.
It's a problem that will persist across the entire ecosystem, which is filled with third-party vendor products, all of which have vulnerabilities.
“How do we look across the third-party vendor ecosystem and see where the next vulnerability is going to be? It’s almost impossible, but we have to try,” Rees said. “It’s not possible, it’s a certainty that we can deal with the number of potential vulnerabilities. We need to focus on and invest in backup and redundancy, but companies say they can’t afford to do things that may never happen. It’s a hard argument to make,” he said.