I first want to congratulate Crowdstrike for there efforts to completely offline as many Windows computers are possible in order to improve security. The successfully managed to identify the problem (working computers) and come out with a solution.
Honestly I don’t think Crowdstrike is the problem here. They are just one of the many companies that have the ability to disrupt just about everything. I think the real issue is giving control of huge amounts of machines and companies to a few companies. We have moved to cloud management and cloud solutions and when those solutions fail or have issues the world comes apart. Imagine if instead of Cloudstrike it was Ninjaone or Zscaler. The amount of trust we put in these solutions is very high and in my opinion is unwarranted. Crowdstrike just happened to be the problem by chance.
I think we need locally run solutions or at the very least a diverse market of cloud solutions. With the locally run and controlled solutions a company and admin has way more control and can keep the entire environment from blowing up.
I also think FOSS is very good in the area of control and independence. It is entirely possible for a FOSS solution to have bad updates. The difference is FOSS doesn’t push updates to millions of machines. The responsibility lies on the company admins and they are in control.
I realize that cloud solutions are very convenient and useful. It is much easier to pay some company like Crowdstrike to make your job easier. However, when doing disaster planning you need to be aware of the risk posed by a outside vendors. I realize updates are important but you shouldn’t allow devices to pull them blindly. Any solution that does this by design is flawed.
Another aspect is the level of access. I think it is important to practice least privilege and to limit potential damage. It is funny to me that the solution to the security is to grant a single program complete access on all machines and then allow the vendor to update it. This seems like a implosion ready to happen. I know sometimes it is hard to avoid but there needs to be several layers to prevent a train wreck. Companies should test updates and then be able to deploy them at there leisure.
On Linux there are immutable systems that have automatic role back. Additionally in embedded Linux there are usually a A/B partition table and if one fails to boot the other one is used. Updates just flash a new image to the unused partition and then reboot with the new image set to primary. I do not know if there is a way to do this on Windows but it would be nice to have some sort of recovery even if it is in the form of a custom driver that counts failures and then acts accordingly. It was such a bad idea to use bitlocker without any plans for how to recover. Organizations basically ransomwared themselves.
The idea that one vendor can update 10+ million computers in critical infrastructure (airlines, hospitals, etc.) is bad. This is a single point of failure.
We now live in a world where so many are dependent on so few. As the better applications rise to the top the more that will use it introducing the problem of a single update can disrupt an entire class of systems. Mitigation is the key. Enterprises with 1000’s of systems may have mitigation process for virus and intrusion but something like what happened with a application that has authority to run in kernel and then the vendor bypassing the “signed” kernel driver by using “p-code” file to make configuration changes was not a mitigation to think about it because they are a trusted vendor that had to achieve high certifications.
Continuously running systems have many techniques to do snapshot / rollback releases and can be achieved with many approaches in the VM world. But with bare iron it’s a bit more difficult.
Unfortunately in the enterprise world there are strict legal requirements for fast updating protection against emerging threats. The update that went bad for Crowdstrike was specifically targeted at a new type of activity actively being used in campaigns against large enterprises (named pipes). The legal and technical needs that crown strike fills aren’t going away just because they had this major issue. I’m saying this as a high-level network engineer for a very large enterprise, which had recently had a ransomware event and afterwards determined that of all the possible options, only crowdstrike was likely to have prevented it (besides user education, but we already have an entire department strictly for user security education, and this event happened in an acquisition that hadn’t been trained to think like being part of a large enterprise yet). I’m sure some customers will move to other security vendors due to the incident, which does dilute the effect size slightly, but for many there still won’t be a better option than to continue using crowd strike.
I worked at a large financial services company before I retired. I agree that no production system should be set up to automatically install untested updates. Our process was to always take the update and run it against a test system to ensure that it didn’t corrupt the system and continued to work as designed. After that initial testing we’d roll it out to a select few production systems (and unless the severity of the security incident was critical we’d run it on the select production systems for a while before proceeding to roll it out around the world). We had hot backup systems and would only install updates on the production side allowing us to fail over to the un-updated backup system should we identify an issue.But the issues of last week speak to an even bigger issue. These companies seem to lack the ability to fail over. This is why companies like CDK take weeks to recover from ransomware attacks.
Security in depth is key.