Open in App
  • U.S.
  • Election
  • Newsletter
  • WashingtonExaminer

    Computing chaos: When CrowdStrike sneezes, the entire business world catches a cold

    By Jack Baruth,

    16 hours ago

    https://img.particlenews.com/image.php?url=4NQF7W_0ue4AnqR00

    On July 18, the average person had no idea what “ CrowdStrike ” was. That changed in a hurry when a minor programming error in a mandatory, and automatically distributed, update to the company’s “Falcon” program disabled much of the Western world’s computing infrastructure. It grounded flights, disabled 911 response centers in at least three states, and caused major business interruptions for businesses as diverse as fast-food restaurants and Formula One teams. This sort of nations-spanning simultaneous failure would have been utterly impossible as recently as a decade ago — but it’s highly likely to happen with increasing frequency from now on, thanks to a combination of individually benign, but collectively deadly, changes in our global technology infrastructure .

    The first of these problems, and the proximate cause behind the CrowdStrike outage , has to do with the way software is written in 2024. Historically, computer programs were the work of small, dedicated teams that understood their products from nose to tail. Often, a single person did the bulk of the work, as was the case for both the popular 1982 home video game River Raid, written by Activision employee Carol Shaw, and the powerful UNIX operating system, initially created by AT&T ’s Ken Thompson. Software written in this fashion tended to be effective, efficient, and largely bug-free, which was important in an era without the possibility of remote software updates. It was also remarkably difficult to predict when it might be finished. The average pre-internet computer programming project was kind of like the later Steely Dan records: just a few enigmatic people running the show, with no accountability to management and little incentive to follow anything other than their own whims along the way.

    It’s not like that anymore. Most of today’s software is developed and released in two-week “sprint” intervals by teams of anonymous and interchangeable hired-gun, low-skill programmers, most of whom are sourced from overseas on a lowest-bidder basis. Each of them is given a tiny piece of the overall task on which to work. Rarely do they possess or even want a greater understanding of how their contributions fit in with the program as a whole. When there is a conflict between the work of two adjacent coders, it is resolved in automatic fashion by the tools with which they work, and not always correctly.

    This creates a culture in which offshore and H-1B programmers are considered to be expendable commodities but their onshore managers are irreplaceable assets who portray themselves as masters of “agile” or “scrum” methods to anonymize and dehumanize the people doing the actual work. Consequently, it is all but irresistible to American tech leaders, even if the promised cost savings from offshore code farms never materialize and even if the resulting product is subpar. Which it almost always is nowadays, in ways ranging from “this new phone is slower than my old one, even though it’s more powerful,” to “this airplane seems to fall out of the sky more often than we’d like.”

    Of course, even the most incompetent software can’t hurt you if it isn’t installed on your computer, or if you have a chance to evaluate it on test systems before installing it. In the past, most major systems were operated by skilled personnel who had the last word on what went on “their” computers. It was common to test software patches or updates on a few systems before releasing them to the company as a whole. This didn’t happen with the CrowdStrike update because the Falcon program, which is supposed to protect computers against criminal hacking and external attacks, has authority that supersedes that of the system administrators. It could install its own updates from CrowdStrike at any time, without the consent of the computer owner. Which it did, pretty much everywhere all at once. Then the dominoes started to fall.

    This “absolute power” is a nonnegotiable part of using the CrowdStrike software. Clients are not allowed to place their own controls or cautions in the process, which places them at the absolute mercy of a company that was clearly willing to install an unproven and outrageously harmful update remotely on their servers with absolutely zero notice. Yet most of them would still have been safe from this combination of carelessness (on CrowdStrike’s part) and helplessness (on theirs) had CrowdStrike followed even the most basic of safety policies during its update process.

    It only took the firm a few hours to understand the problem and to provide a fix. Had CrowdStrike deployed this noncritical update on a staggered basis, as has traditionally been done across the industry, it would have likely fixed the problem before some major percentage of its customers were affected by it. Instead, it sent it to everyone at the same time. There’s little explanation for doing so other than the culture of “we know what’s best for you” tech company arrogance that manifests itself everywhere from the lack of a “back” button on iPhones to the general belief that a program for which you’ve discontinued an annual license program, such as McAfee Antivirus, has the inherent right to “pop up” on your screen perpetually and demand additional payment like the electronic equivalent of a Barbary pirate.

    Yet even the above-described combination of carelessness, absolute power over consumers, and staggering corporate narcissism is nothing new to American consumers. It’s why the General Motors “X cars” had to rack up more than a dozen fatalities before they were recalled for a rear-brake fix. That famous problem, however, affected just a fraction of the cars sold in showrooms at the time. Ford Fairmont buyers didn’t have to worry. (The same was true for buyers of the Chevrolet Vega when the Ford Pinto was recalled for fuel system-related fires.) The automotive business is inherently competitive. There are plenty of different manufacturers who would like to provide your next car.

    This was true of computing for a long time, as well. At the turn of the century, there were a dozen different vendors for server operating systems and multiple providers for almost every imaginable type of software. This diversity of environment has declined at a Brazilian rainforest pace over the last two decades. The vast majority of servers are now either Microsoft Windows, which was affected by the outage, or a few different flavors of Linux, which were not.

    We are now dangerously close to a “monoculture” in many aspects of tech. The vast majority of cloud servers are run by Amazon, so when an outage strikes, as it did in the “US-East” region of Amazon Web Services on Dec. 7, 2021, the effects are immediate and far-reaching. The combination of Windows Server and CrowdStrike Falcon is common at more than half of the Fortune 500 companies, so when CrowdStrike sneezes, the whole business world catches a cold.

    The greatest irony is that these single points of failure are often the direct result of policies that are meant to increase the stability and availability of services. Our modern computing dogma of “site reliability engineering” demands the highest possible number of absolutely identical servers and software builds. This supposedly makes maintenance and upkeep easier. In practice, it tends to mean the entire infrastructure depends on one piece of software, and that one piece of software often has absolutely disproportionate power to knock everything down.

    How did we get to the monoculture? Some of you may remember the old phrase “Nobody ever got fired for buying IBM.” The famously anticompetitive tech sector has used a series of technical partnerships and deliberate incompatibilities to extend this mindset to nearly every level of software and computing. CrowdStrike is an Amazon Web Services partner, a Dell partner, a Netskope partner, and so on. When you buy one product in the stack, you’re encouraged to buy the other products as well — so most tech leaders simply do the easiest thing.

    Often, this means abandoning common sense altogether. The Okta platform, for example, puts all of your company’s authentication in the hands of a third party, while CyberArk will gladly store all of your passwords. It’s perfectly ordinary nowadays for a Fortune 500 company to hand all of its passwords, privileges, and authentications to a third party while at the same time employing a Byzantine labyrinth of policies and procedures to restrict the privileges of its own tech support and system administration staff. When these third-party authentication and password providers are compromised, they are often unwilling or reluctant to disclose their problems to the very customers they are supposed to protect, as was the case with both Okta and password “vault” LastPass in 2022. What did most LastPass customers do when they were betrayed? Most of them just moved en masse to Keeper or 1Password. This is like giving your wallet to a random person on the subway, watching them run away with it, and concluding that your mistake was giving your wallet to the wrong person.

    CLICK HERE TO READ MORE FROM THE WASHINGTON EXAMINER

    In light of the above, the only surprising thing about the CrowdStrike problem was that it took so long to happen at this scale. It will almost certainly happen again, with another monoculture “choke point,” and it will keep happening until corporations learn the correct lessons as a result. Doing software development in the United States, with your own employees, solves a lot of the problems. Refusing to work with software providers that insist on taking control of your systems will handle most of what’s left. A little bit of diversity focus wouldn’t go amiss. In this case, we’re talking “diversity of computing infrastructure.”

    Yes, this outage was CrowdStrike’s fault. That’s like saying that the Challenger disaster was an O-ring problem. It doesn't convey the broken nature of the system that let it happen. In this case, the lessons should be clear to every tech leader in America. Most of them won’t bother to learn those lessons or even take the smallest steps to prevent the next problem. After all, this outage is now handled. It’s history. There’s just one little problem: It’s the kind of history that is all but certain to repeat, again and again.

    Jack Baruth was born in Brooklyn, New York, and lives in Ohio. He is a pro-am race car driver and a former columnist for Road and Track and Hagerty magazines who writes the Avoidable Contact Forever newsletter.

    Expand All
    Comments / 0
    Add a Comment
    YOU MAY ALSO LIKE
    Most Popular newsMost Popular

    Comments / 0