Automatic updates are recommended by many vendors as they are considered essential for safeguarding against security threats and maintaining system performance. Updates not only enhance security but also deliver bug fixes and new features, contributing to improved user experience. Software updates, however, come with the inherent risk of breaking existing functionality and can potentially interfere with other software or the operating system itself causing unintended side effects. Automatic updates that the user has no control over escalate the risk further.
The CrowdStrike incident
Last Friday, July 19, 2024, the automatic update published for its Falcon sensor endpoint protection product by the American cybersecurity company CrowdStrike caused a worldwide outage disrupting governments and businesses in many industries. The update triggered a bug in the Windows kernel driver of the Falcon Sensor resulting in a "blue screen of death" that left devices in a continuously crashing state requiring a manual remedy on each affected machine.
The recovery efforts are still ongoing and the incident is estimated to cause billions of dollars in damage. Following is a list of problems with CrowdStrike software that together led to one of the biggest IT outages of history:
- Use of a kernel driver
-
A Windows kernel driver has the same permissions and access to the machine as the operating system itself. A bug in the Windows kernel driver most often causes the infamous blue screen of death which can be only fixed by booting into safe mode.
- Privileged code using external content
-
The bug was triggered by a content file that was pushed to the endpoints. Such "configuration" content should not be allowed to trigger bugs in code running in privileged mode, i.e. kernel space.
- Automated updates
-
IT teams and users have limited control over automated updates as these are pushed to production systems directly without the possibility of doing tests before rolling out into production.
- Insufficient testing
-
Following secure software development practices and having proper quality assurance processes can significantly reduce the chances of releasing such software. Various testing methodologies and software development practices such as unit tests, integration and end-to-end tests, fuzzy testing, code coverage, static code analysis, and even formal verification can be carried out to detect problems early during development.
Due to what happened there are now jokes on the internet that categorize "software updates" along with "phishing" and "malware" as the biggest IT security risks.
No software is free of bugs and all people make mistakes. However, with a well-designed software architecture and secure software development practices, the risks of such incidents can be significantly reduced.
How NXLog works
Similarly to other endpoint security software, our NXLog Enterprise Edition product is often deployed to endpoints when used as a log collector agent. We would like to highlight a few things about how the NXLog Enterprise Edition agent works and how NXLog, our company, operates in order to eliminate the risk of similar failures of high impact.
- No kernel drivers
-
The NXLog Enterprise Edition does not contain and install any custom-developed kernel drivers on any of the supported operating systems. This ensures that code is never executed on the same privilege level as the OS and kernel.
- Service running in user-space
-
The service runs in user-space to reduce the risk and impact of potential issues. In case of an issue such as a high CPU usage, service crash, or a compatibility problem with another software component, the service can be stopped and uninstalled even remotely through standard management tools, as a software failure should not block access to the host operating system.
- Principle of least privilege
-
NXLog software was designed with the principle of least privilege. On Linux, macOS, and Unix systems the service runs under a non-root account by default, only retaining the privileges it needs. On Windows, our Hardening NXLog guide provides details on how to configure the NXLog agent to run under a regular non-system account.
- Standard APIs
-
Some log types need to be collected directly from the operating system. The NXLog Enterprise Edition agent uses standard kernel APIs to interact with the OS that are well tested by the operating system vendor and this provides compatibility without the risk of using kernel drivers.
- Agent-less log collection mode
-
The NXLog Enterprise Edition can be configured to collect event log data remotely in an agent-less mode without directly installing it on the endpoints and utilizing the log forwarding capabilities of the host operating system. While the agent-based mode provides more flexibility and access to more log sources, using it in an agent-less mode completely eliminates the potential for any problems and interference on the host operating system generating the logs.
- No automatic software updates
-
The NXLog Enterprise Edition has no embedded auto-update mechanism and software updates cannot be initiated by NXLog in an automatic fashion. We always recommend our customers and users to use traditional software update tools and mechanisms. This allows rigorous update cycles and testing processes to be implemented, giving you full control over how and when updates are rolled out using your own automation tools. Remote configuration updates through NXLog Manager and NXLog Platform are also user-controlled and provide total visibility to you over what configuration changes would be pushed to the agents.
- Quality assurance
-
Our release process includes a rigorous quality assurance stage to ensure that functional, stability, and performance requirements are met and no major or critical bugs are included in the release. Our automated testing pipelines execute hundreds of unit, functional, end-to-end, fuzzy, performance, stability, upgrade/downgrade, code analysis, memory leak, and concurrency checks across dozens of supported platforms and CPU architectures that the NXLog Enterprise Edition supports.
To plan an upgrade or test new features you may always refer to change logs and release notes published on our website, that include all known issues. The following resources should help:
Feel free to reach out to our team if you have questions or need assistance.