How does your organisation handle patching from a security and functionality standpoint ? There are differing views around how this should work dependant on numerous factors, but the two most common by far are access to the latest features from the functionality perspective, and attitude to risk from the security angle. So which of these routes is the right direction to take ?
A lot of this depends on the industry sector and seems to deviate heavily from one institution to another. In this article, we will analyse the difference between the two views, and decide which process to adopt. The important thing to remember here is that there is no hard and fast rule or “accepted standard” – only whichever route you take should be the consistent approach across the board. Let’s look at the two key areas, and assess the pros and cons of each scenario
Predefined patching timelines
The most commonly adopted standard is to perform patching per quarter. In today’s landscape, this is a very ineffective and poor choice from a security posture. Additionally, it means that for a maximum of three months, your systems are running without the latest security updates. Bearing in mind that Microsoft typically releases patches on the second Tuesday of each month (often referred to as “Patch Tuesday” with exceptions for critical security updates which are released as soon as they are available), this is a long time to wait and would expose your environment to unprecedented (and mostly avoidable) risk. The upside of adopting such an approach is that it gives the organisation sufficient time to test each approved patch against the relevant application to ensure functionality before proceeding into live. This mechanism is very effective, but also attracts significant overhead in terms of having an up to date environment to perform patching against. This approach also requires users who are willing to test the application and perform regression testing to ensure functionality is not negatively impacted.
Admittedly, if you are using virtual machines within your production environment, it is perfectly feasible to create clones of these images and copy them into a ring fenced environment. Much of this really depends on the disparate nature of some networks, and latency issues alone could render this process inoperable. Those leveraging high speed connectivity in the form of a local LAN needn’t be concerned with this point. Once the environment is ready, the patches can be applied, and UAT performed to gain approval to implement in production. One sticking point here. You are going to need some form of patching utility that can bulk approve updates, and then replicate this into the production network. There are a variety of tools that can accomplish this end goal with little effort. For the sake of this article, we’ll stick to WSUS.
The great thing about this platform (despite its various quirks and nuances) is that a WSUS server can be either upstream or downstream dependent on your needs. If your organisation needs to follow SOX or ITIL based controls, then this situation is ideal. In this case, patches are approved in the test network which is the upstream. The production network is downstream, meaning it cannot make decisions in relation to which patches it can install, but relies on the “master’ for a copy of it’s catalogue. In essence, the upstream server is preferred, and no changes are accepted in the downstream until this setting is removed. This negates the possibility of an unapproved patch making its way into the production environment without the required functionality evidence testing – providing the number of updates approved in test matches the number in UAT and production.
The fundamental issue with this approach is the manpower and associated effort required to both adhere to the controls required, and ensure timely completion of all UAT in order to proceed into production. Should the process take longer than expected, you could find yourself either falling behind quickly, or overlapping into the following month. This means either a very short time period between patching cycles and the requirement for further UAT to accommodate delta changes, or an agreed cut off point during the month to ensure a timely completion of the process.
Much of this really depends on staffing levels both in IT and the associated user requirement for testing in each department that uses any particular application.
For control purposes, the above patching process is ideal, but can very quickly become aligned with painting the Forth bridge 🙂
Admittedly, this quote is somewhat colloquial, but you’d soon realise the irony if you found yourself in this situation.
Rolling patching scenario
Now let’s look at an alternative model. What if your patching process was a rolling one with automated approvals for all security and critical updates, auto deployment to servers and workstations, and most importantly, acceptance from the business that security took preference over functionality ? Sounds great, yes ? You’d get your weekend back, and could use that time instead to play golf……
Right – back to reality with a resounding thump. For any business that has controlled processes in place that require test, UAT, and production testing evidence for an audited application, this just isn’t feasible – let alone plausible. But what’s the risk ? Microsoft rarely [sic] introduces patches that break functionality (I hope the sarcasm isn’t lost here) of servers or cause inadvertent slowness and latency in any application, but have been known on numerous occasions to cause issues with products like Citrix, or anything else that may rely on the underlying TCP/IP transport stack to communicate with network API’s effectively. Based on this, the rolling mechanism just wouldn’t work – can you imagine your users attempting to access an application to find it doesn’t work correctly, hangs whilst retrieving data, or crashes out with a virtually useless error message that means nothing to man or beast ?
The final insult would be having to find a fix for the related issue in production, meaning frustrated users who frequently make you the target of their anger and no doubt senior management breathing down your neck waiting for am update on remediation. In most cases, you’d find yourself probably having to roll back in order to restore service and then resorting to some form of test environment in order to replicate and resolve the problem. In any case, this scenario is less than ideal and is very damaging from the reputation and confidence perspective.
From the results of this secondary approach, you’re probably thinking that the additional cleanup and associated headache just isn’t worth the shortcut. In some ways, you’d be right – it just wouldn’t be feasible to conduct operations in this fashion and expect the business to encounter the same problem every month – and the same headless chickens running around seeking a resolution. However, there is a compromise.
What could (and does) work ?
As I alluded to earlier, there is no hard and fast approach to patching as a standard that is going to suit every IT department and business need, but it would be perfectly acceptable to consolidate the two most popular methodologies to form something that allows for flexibility. For this to work effectively, you’d need to perform an assessment with the consideration that not every system is considered mission critical. For example, a broken print server isn’t the end of the world, and neither is a single faulty domain controller or some file servers. The best approach is to split the servers that require patching into two common parts – infrastructure and applications.
The infrastructure portion should comprise of servers that do not host applications that users require access to. Despite the various nuances around Microsoft and their testing of patches before release, I personally have only seen a handful of occurrences in my career where a particular patch has meant a system will not start, or means the login process takes an extended amount of time to complete. There is usually an underlying issue that manifests itself – often a conflict within the system you’ve patched, or an issue with another part of the system that only comes to light as a result of a recent system file or DLL change. In most cases, Microsoft aren’t in a position to perform patch testing for everyone, and the updates are released having passed relatively vigorous testing processes based around common configurations and scenarios. I’m no Microsoft advocate, but certainly appreciate their stance on patch releases.
This would then leave applications – which unless you’re running an application or a file service on a domain controller (which in itself is a bad design), then you should have a reasonable split. Similarly, there are common applications that are used by businesses that are not considered mission critical. You could patch these, and have a user perform testing to ensure functionality. In most cases, another way of deploying patches would be to create a snapshot at the Virtual Machine level, and patch against that. If it all goes horribly wrong, you simply revert to snapshot (there is a caveat with this approach on Domain Controllers depending on revision – see note below*).
* Snapshots for domain controllers on Windows prior to version 2012 are definitely a no-go. With version 2012 MS added functionality to detect this. However I didn’t try this myself, yet..
In case this is the only DC in your domain, it may (no guarantees) work to revert it to the snapshot. With other DCs in the domain the way to go would either be a fresh installation/promotion or using an AD backup of another DC and restore it to the DC in question after reverting to the snapshot and immediately booting to AD restore mode.
Attribution and original thread can be found here
However, this can prove to be a double-edged sword. If you are using VMware and forget to consolidate the snapshot after testing approval, you could find yourself out of disk space very quickly. In a lot of cases, running for too long on a snapshot can also present performance issues and at the worst, render the machine inoperable should the snapshot become corrupted. If you consider this approach, then why would any user need to test something they actually have no direct access to ? The ancillary services that infrastructure servers provide are essential, yes, but it is rare that problems will arise forcing your entire network offline as a result of a Windows patch.
Do you leverage time zone differences for patching ?
For example, if your business operates in multiple locations that are geographically dispersed, then this is a good opportunity to patch a system remotely when there is nobody connected to or using it. My preference here is always to have an alternate method of accessing a server that isn’t based in the local site, so if anything does go wrong, you have another way of accessing. Typically, an iDRAC or equivalent is suitable for this purpose, and should always be considered – it can save a lot of time and headaches if you do land up in the unfortunate position of having a server that will not boot – and it’s 5,000 miles away. At this point, you either leverage a local support provider in that region, or walk the user through connecting a mouse, keyboard, and monitor to the server in order to gain an insight as to what has happened.
How do you handle your patching ?