After the fallout of the WannaCry attack that infected and effectively disabled machines located in 150 countries around the world, one question rises to the surface of every security think tank.
……..Is it possible or practical to utilise big data and machine learning to accurately predict the next cyber attack ?
Given the extent of available intelligence and information around previous campaigns, you’d think the security industry would be better placed to identify where the next threat will originate from. Whilst some may be of the opinion that the ability to predict exists, others do not seem to share this view. Here’s my take on the situation.
Can big data accurately predict the next attack ?
A common misconception is that the next attack can be predicted using analysis and data collected from previous campaigns, or AI that analysts can use to study active traffic streams. I’m not dismissing the use of big data analytics, as I believe that this can only add to existing information, and bolster it. Additionally, organisation such as Norse and FireEye provide active threat maps that show attacks in real time. However, with that in mind, how often do you hear of a local law enforcement agency battering down the door of a suspect in a dawn raid for a cyber attack that hasn’t even happened yet ? You could argue that similar techniques are used to snare terrorists before they actually engage in an act, although again, local authorities and law enforcement are responding to intelligence and surveillance rather than the ability to pinpoint an event in the future that has not taken place. Nobody can accurately predict the exact date, time, and location of the next cyber attack. You could have information sources (rather like a “digital supergrass“), but this could not be classed as prediction – it’s intelligence and privileged information derived from within the walls of the criminal underworld.
Here’s a suggested paradigm: You can think of the ability to determine the exact date and time of a cyber attack in the same way as attempting predict the lottery numbers. The lottery is based on random number generation using sophisticated and complex algorithms, and there is no surefire way to predict what the next set of numbers would be – even with advanced machine learning, and big data analytics. The data available would be a sample or subset of what has already taken place, and not an indication of a future event that could be used to establish advance decision making (unless you’re watching Terminator Genysis).
The complex detection algorithms that could be created in an effort to identify key information from big data and determine the location of the next attack reminds me of the Willy Wonka sketch where Tim Brooke-Taylor attempts to use the 70’s equivalent of big data analysis to predict where the next golden ticket will surface.
Humorous ? Yes. A practical response in today’s technology sphere ? No – not really. The point I’m trying to make here is that cyber criminals change tactics on a regular basis, often making full use of encryption and underground networks in order to evade detection. As these networks remain difficult to gain access to or navigate, a significant chunk of useful information from big data inevitably drops off the map. To make matters worse, there are always “unknowns”, such as the NSA crafted (and subsequently leaked) tools known as ETERNALBLUE and DOUBLEPULSAR that add to the complexity of extracting any useful information or intelligence from what is realistically exabytes of data.
There is an interesting discussion how machine learning and big data can be leveraged as a mechanism to predict the next attack. However, this relies on real time activity, and is more aligned to behavioural pattern or trend analysis. Once you analyse the traffic stream, you could be in a position identify the payload (providing the conversation hasn’t been encrypted), then use intelligence already collated based on the source address to determine if the stream is legitimate, or malicious. One of the many inevitable questions surrounding the usage of machine learning and big data to detect attacks in real time is privacy. This data may be privileged, or of a sensitive nature, and the owner may not permit access by a third party for various security, compliance, and legal reasons. Applications such as Splunk emerged as a SEIM leader in the 2015 Gartner Magic Quadrant owing to the platform’s ability to process massive amounts of data (mainly in the form of logs collected from multiple sources), normalise and compress it, then store it in indexed files that allow the retrieval of human readable information. Since then, various platforms have emerged offering the same abilities
Can intelligence alone predict an attack ?
Previously gathered intelligence plays a pivotal role in determining the source and destination of an attack. However, as an analyst or an expert, you need to be aware that the information gathered when assimilated only provides an insight to the type of attack conducted, it’s origin, the target, and the impact. It doesn’t necessarily serve as a mechanism to predict the next attack. In fact, the data concerning the source and destination only becomes of any use once all the evidence is examined and dissected – after the event.
To understand why this is, we should look at how attacks work. Even the most basic of campaigns needs a source, destination, and objective before launch. Additionally, an attack can originate from literally anywhere in the world (either by real geographical location or by proxy). A similar thought pattern could be applied to DDoS attacks. By their definition, these are distributed over numerous geographical locations (hence multiple sources), and would be extremely difficult to predict in advance. Regardless of informative research and analysis, there is still no solid platform or foundation on which to predict where the source will originate from. Admittedly, there is sufficient data to suspect a region or entity of an attack based on prior activity, or to consider an organisation at risk of an attack, but unless you have inside information, how can you realistically tell when and by whom ?
Does migrating to IPv6 have an impact on prediction or successful attack ?
Cyber criminals are becoming increasingly adept at making life difficult for security teams and researchers to firmly establish a correlation between emerging traffic patterns and their exact locations. Much the same can be said of the target. As an example, there are 588,514,304 reserved IPv4 addresses (typically used in private network ranges and designed for internal network classes A – D), and 4,294,967,296 (32 bits 2^32) IPv4 addresses in total. Simple math yields an incredible 3,706,452,992 public addresses that can be assigned and are capable of routing public traffic (and yet, IPv4 is nearly exhausted. 4 billion address for a population of 6 billion people doesn’t take too long to work out the obvious deficit). You’d have to agree this is still a staggering amount, and based on this volume, how could you possibly predict a source or target in advance ?
The problem with an inability to predict where the next attack will surface will be made much worse by the pending adoption of IPv6. This address space is (128 bits 2^128) in size, containing an eye watering number of addresses – in fact 340,282,366,920,938,463,463,374,607,431,768,211,456. However, not all of these would be public. Routable IPv6 internet addresses would need to start with a hextet value of 2xxx or 3xxx, but that still leaves 42 undecillion addresses – essentially a mind blowing publicly available 42,000,000,000,000,000,000,000,000,000,000,000,000
If an attacker were to perform a reconnaissance scan of an IPv6 subnet, this could take an astonishing amount of time to complete owing to the size of the range. Most organisations are likely to reduce the range where possible for efficiency purposes, but even at a /64 or even /60, an attacker needs a scanner that can process a million addresses per second – oh, and around 40,000 years to spare in the process. Not very appealing or a productive use of time for anyone with a specific end goal in mind.
Prediction versus prevention – which route should you choose ?
Attacks can be detected using a variety of intelligence, historical information, and analytical processes, but only when such an attack is in progress. Early detection is obviously key, but attacks tend to appear out of nowhere – often starting out as low level reconnaissance probes before launching the campaign itself. However, even at this early stage, these are enough of an indicator to trigger warning systems such as IDS, and issue alerts to security administrators. Simple reconnaissance involving port knocking or service scans also tend to raise threat levels on security appliances, and this early detection is what really matters in the event of an attack. It’s not the ability to predict, it’s the ability to prevent that is of the utmost importance and value. If you invested all your time and effort in machine learning and big data in an effort to predict the next cyber attack, your focus inevitably moves away from the protection perspective to an analytical stance – not necessarily the best advantage if an attack should eventually manifest itself.
So is trying to predict the next cyber attack a waste of time ? Not at all. Time and effort should be invested in analysing big data, and decisions only made based on previous events and established or factual activity – something you can’t realistically predict until it has happened. If you consider the probability that the WannaCry attack was launched by a minority rather than an a highly organised gang, what are your chances of stopping it before it even happens ? Intelligence is only as good as the information source that provides it. If it’s not there to start with, how can it be predicted ?
What’s your view ?